vector libs: file based spatial index

29 messages Options
Embed this post
Permalink
1 2
Markus Metz-2

vector libs: file based spatial index

Reply Threaded More More options
Print post
Permalink
Hi,

I have now a completely file based spatial index for vector libs that
could reduce memory consumption considerably. The spatial index file is
usually 2 - 3 times larger than the topo file, for point datasets it is
about 5 times larger than the topo file. In GRASS6.x all that is always
kept in memory and built from scratch.
My implementation is completely file based, also when creating or
updating a vector. This comes obviously with a speed penalty because
reading in memory is faster than reading from file. With all sorts of
tricks and relying on the system to cache files, the file based spatial
index needs 2 times as much time to get built than the memory based
spatial index, e.g. v.build takes twice as long (on my test system).
That's less than I was afraid it would take but still two times as long.
Other new features are dynamic 2D or 3D spatial index (was fixed to 3D
also for 2D vectors), higher portability across different platforms,
speed increase and more control over memory consumption by translating
all recursive functions to non-recursive functions, and it recycles dead
space in files (principle of Radim's suggestion in vector TODO). A
complete rewrite of the rtree library I'm afraid.

Considering that a file based spatial index is only useful for massive
vectors where memory can become a limiting factor, I hesitate to commit
to trunk. A sophisticated compromise would be to build the spatial index
in memory and then write it out to disk. Opening an old vector for
reading only (Vect_open_old()) would be much faster, spatial queries not
much slower than before because the spatial index head is loaded only,
searches in file are fairly fast. When opening a new vector, the spatial
index could be built in memory and then written out. When opening an old
vector for update the spatial index could be read to memory completely
and written out when closing.

What to do now? Leave it all in memory as in grass6, build in memory
then write out (risk of running out of memory on massive datasets), or
keep it always in file? I'll not commit any time soon (also waiting for
the lib/raster commotion to settle down), I need feedback on how to
proceed or if I should forget about it.

Markus M


_______________________________________________
grass-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-dev
Paolo Cavallini

Re: vector libs: file based spatial index

Reply Threaded More More options
Print post
Permalink
Markus GRASS ha scritto:
> What to do now? Leave it all in memory as in grass6, build in memory
> then write out (risk of running out of memory on massive datasets), or
> keep it always in file? I'll not commit any time soon (also waiting for
> the lib/raster commotion to settle down), I need feedback on how to
> proceed or if I should forget about it.

I think advice from Radim would be very useful here.
All the best.
--
Paolo Cavallini: http://www.faunalia.it/pc
_______________________________________________
grass-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-dev
Markus Metz-2

Re: vector libs: file based spatial index

Reply Threaded More More options
Print post
Permalink
Paolo Cavallini wrote:

> Markus GRASS ha scritto:
>  
>> What to do now? Leave it all in memory as in grass6, build in memory
>> then write out (risk of running out of memory on massive datasets), or
>> keep it always in file? I'll not commit any time soon (also waiting for
>> the lib/raster commotion to settle down), I need feedback on how to
>> proceed or if I should forget about it.
>>    
>
> I think advice from Radim would be very useful here.
> All the best.
>  
OK, let me rephrase. I think I have two alternatives to the current
implementation of the vector spatial index and would like to know if
grass7 should get 1) faster vector display and lower memory consumption
at the cost of (sometimes) slower vector processing [1], 2) faster
vector display, a similar speed in vector processing but keep the risk
of running out of memory when processing large datasets, or 3) no
changes to the spatial index. IMHO this should be a general decision of
the GRASS community, not of one or two developers.

The actual implementation is not easy and advice from Radim and other
developers would really be very useful.

Markus M

[1] the sometimes slow speed of v.lidar.edgedetection is caused by the
module not the vector libs, as is probably true for other modules e.g.
v.extract (I remember a remark earlier this year, haven't searched the MLs)

_______________________________________________
grass-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-dev
Jachym Cepicky

Re: vector libs: file based spatial index

Reply Threaded More More options
Print post
Permalink
Nice work.

What about make it scalable? In case, the vector library would take more
then GRASS_VECTOR_MEMORY_MAX (or similar), make it file-based. Keep it
in memory otherwise.

j

On Tue, Jun 23, 2009 at 08:28:49PM +0200, Markus GRASS wrote:

> Paolo Cavallini wrote:
> > Markus GRASS ha scritto:
> >  
> >> What to do now? Leave it all in memory as in grass6, build in memory
> >> then write out (risk of running out of memory on massive datasets), or
> >> keep it always in file? I'll not commit any time soon (also waiting for
> >> the lib/raster commotion to settle down), I need feedback on how to
> >> proceed or if I should forget about it.
> >>    
> >
> > I think advice from Radim would be very useful here.
> > All the best.
> >  
> OK, let me rephrase. I think I have two alternatives to the current
> implementation of the vector spatial index and would like to know if
> grass7 should get 1) faster vector display and lower memory consumption
> at the cost of (sometimes) slower vector processing [1], 2) faster
> vector display, a similar speed in vector processing but keep the risk
> of running out of memory when processing large datasets, or 3) no
> changes to the spatial index. IMHO this should be a general decision of
> the GRASS community, not of one or two developers.
>
> The actual implementation is not easy and advice from Radim and other
> developers would really be very useful.
>
> Markus M
>
> [1] the sometimes slow speed of v.lidar.edgedetection is caused by the
> module not the vector libs, as is probably true for other modules e.g.
> v.extract (I remember a remark earlier this year, haven't searched the MLs)
>
> _______________________________________________
> grass-dev mailing list
> [hidden email]
> http://lists.osgeo.org/mailman/listinfo/grass-dev
--
Jachym Cepicky
e-mail: jachym.cepicky gmail com
URL: http://les-ejk.cz
GPG: http://www.les-ejk.cz/pgp/JachymCepicky.pgp
Key fingerprint: 0C6D 0EAE 76BD 506C F299  ED8A C8AB 74B8 08D4 E08F


_______________________________________________
grass-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-dev

signature.asc (204 bytes) Download Attachment
Moritz Lennert

Re: vector libs: file based spatial index

Reply Threaded More More options
Print post
Permalink
In reply to this post by Markus Metz-2
On 23/06/09 20:28, Markus GRASS wrote:

> Paolo Cavallini wrote:
>> Markus GRASS ha scritto:
>>  
>>> What to do now? Leave it all in memory as in grass6, build in memory
>>> then write out (risk of running out of memory on massive datasets), or
>>> keep it always in file? I'll not commit any time soon (also waiting for
>>> the lib/raster commotion to settle down), I need feedback on how to
>>> proceed or if I should forget about it.
>>>    
>> I think advice from Radim would be very useful here.
>> All the best.
>>  
> OK, let me rephrase. I think I have two alternatives to the current
> implementation of the vector spatial index and would like to know if
> grass7 should get 1) faster vector display and lower memory consumption
> at the cost of (sometimes) slower vector processing [1], 2) faster
> vector display, a similar speed in vector processing but keep the risk
> of running out of memory when processing large datasets, or 3) no
> changes to the spatial index. IMHO this should be a general decision of
> the GRASS community, not of one or two developers.

What size of vector data are we talking about concerning the risk of
running out of memory ? Would it be possible to implement both 1) and 2)
with 2) being the default and a flag to switch to 1) for very large
vectors ?

You wrote:
> Considering that a file based spatial index is only useful for massive
> vectors where memory can become a limiting factor, I hesitate to commit
> to trunk.

Well, I don't know what you call massive, but one of the main problems
with the memory based index is that currently the spatial index is
rebuilt for each run of v.what as it is stored no where. This makes
querying large (e.g. 20-30.000 polygons) _very_ slow when using the GUI
(which will be the only option in grass7 as xmons have disappeared).

I'm not sure I understand everything correctly here, but I have the
feeling that there are two questions here:

1) Should we have a file-based storage of the spatial index ? This can
then be read into memory when necessary, which still should be faster
than rebuilding it each time.

To this I clearly say +1.
The question here is: how to make sure the index is up to date ?

2) Should the entire treatment of the index be file-based, i.e. the
index is never read into memory, but always accessed via file, with the
speed penalties you spoke about.

To this I would say, if we can have a file-based, permanent storage of
the index, but read into memory for treatment, unless a flag says not to
load it into memory, than this would probably be the ideal, within my
limited understanding of the issue.

Moritz
_______________________________________________
grass-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-dev
Benjamin Ducke

Re: vector libs: file based spatial index

Reply Threaded More More options
Print post
Permalink
Great work!

In keeping with good GRASS traditions, I would suggest to have this
user-configurable via an env var. Perhaps:

GRASS_SPATIAL_INDEX =
        AUTO (default: keep in mem if enough RAM available)
        FILE
        MEMORY

Cheers,

Ben


Moritz Lennert wrote:

> On 23/06/09 20:28, Markus GRASS wrote:
>> Paolo Cavallini wrote:
>>> Markus GRASS ha scritto:
>>>  
>>>> What to do now? Leave it all in memory as in grass6, build in memory
>>>> then write out (risk of running out of memory on massive datasets), or
>>>> keep it always in file? I'll not commit any time soon (also waiting for
>>>> the lib/raster commotion to settle down), I need feedback on how to
>>>> proceed or if I should forget about it.
>>>>    
>>> I think advice from Radim would be very useful here.
>>> All the best.
>>>  
>> OK, let me rephrase. I think I have two alternatives to the current
>> implementation of the vector spatial index and would like to know if
>> grass7 should get 1) faster vector display and lower memory consumption
>> at the cost of (sometimes) slower vector processing [1], 2) faster
>> vector display, a similar speed in vector processing but keep the risk
>> of running out of memory when processing large datasets, or 3) no
>> changes to the spatial index. IMHO this should be a general decision of
>> the GRASS community, not of one or two developers.
>
> What size of vector data are we talking about concerning the risk of
> running out of memory ? Would it be possible to implement both 1) and 2)
> with 2) being the default and a flag to switch to 1) for very large
> vectors ?
>
> You wrote:
>> Considering that a file based spatial index is only useful for massive
>> vectors where memory can become a limiting factor, I hesitate to commit
>> to trunk.
>
> Well, I don't know what you call massive, but one of the main problems
> with the memory based index is that currently the spatial index is
> rebuilt for each run of v.what as it is stored no where. This makes
> querying large (e.g. 20-30.000 polygons) _very_ slow when using the GUI
> (which will be the only option in grass7 as xmons have disappeared).
>
> I'm not sure I understand everything correctly here, but I have the
> feeling that there are two questions here:
>
> 1) Should we have a file-based storage of the spatial index ? This can
> then be read into memory when necessary, which still should be faster
> than rebuilding it each time.
>
> To this I clearly say +1.
> The question here is: how to make sure the index is up to date ?
>
> 2) Should the entire treatment of the index be file-based, i.e. the
> index is never read into memory, but always accessed via file, with the
> speed penalties you spoke about.
>
> To this I would say, if we can have a file-based, permanent storage of
> the index, but read into memory for treatment, unless a flag says not to
> load it into memory, than this would probably be the ideal, within my
> limited understanding of the issue.
>
> Moritz
> _______________________________________________
> grass-dev mailing list
> [hidden email]
> http://lists.osgeo.org/mailman/listinfo/grass-dev
>
>


--
Benjamin Ducke
Senior Geospatial Consultant

Oxford Archaeology
Janus House
Osney Mead
OX2 0ES
Oxford, U.K.

Tel: +44 (0)1865 263 800 (switchboard)
Tel: +44 (0)1865 980 758 (direct)
Fax :+44 (0)1865 793 496
[hidden email]




------
Files attached to this email may be in ISO 26300 format (OASIS Open Document Format). If you have difficulty opening them, please visit http://iso26300.info for more information.

_______________________________________________
grass-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-dev
Paul Kelly

Re: vector libs: file based spatial index

Reply Threaded More More options
Print post
Permalink
In reply to this post by Markus Metz-2
On Tue, 23 Jun 2009, Markus GRASS wrote:

> My implementation is completely file based, also when creating or
> updating a vector. This comes obviously with a speed penalty because
> reading in memory is faster than reading from file. With all sorts of
> tricks and relying on the system to cache files, the file based spatial

Hi Markus, The last point above sounds interesting. What sort of tricks?
Are you using memory-mapped files? My thinking would be to always read it
from the file, and rely on the operating system as best as possible to not
slow down performance too much (compared to holding the index in memory)
for small sizes of indexes. I think that would be a seamless foolproof
(from the user's point of view) approach that would scale very well with
both increasing size of datasets and performance of computer hardware.

Paul
_______________________________________________
grass-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-dev
Markus Metz-2

Re: vector libs: file based spatial index

Reply Threaded More More options
Print post
Permalink
Paul Kelly wrote:
> On Tue, 23 Jun 2009, Markus GRASS wrote:
>
>> My implementation is completely file based, also when creating or
>> updating a vector. This comes obviously with a speed penalty because
>> reading in memory is faster than reading from file. With all sorts of
>> tricks and relying on the system to cache files, the file based spatial
>
> Hi Markus, The last point above sounds interesting. What sort of tricks?
When inserting or deleting or searching, the search patch is remembered
(kept in memory) and walked back by using a stack in a non-recursive
functions. This reduces file IO operations about by half and should be a
faster algorithm than the recursive alternative. As an example,
inserting a new item into an RTree that already has about 50,000 items
requires a total of 3 file reads and 3 - 5 file writes, each file IO
reads or writes 528 bytes on a 32 bit system and 586 bytes on a 64 bit
system. And when using intermediate files, the whole chunk of 528/568
bytes is read or written at once.
> Are you using memory-mapped files?
What are memory-mapped files? Excuse my ignorance, I'm just a
self-trained coder (learning by doing).
> My thinking would be to always read it from the file, and rely on the
> operating system as best as possible to not slow down performance too
> much (compared to holding the index in memory) for small sizes of
> indexes. I think that would be a seamless foolproof (from the user's
> point of view) approach that would scale very well with both
> increasing size of datasets and performance of computer hardware.
It would also require less complex code, it's already quite a bit more
complex by now.

Markus M

_______________________________________________
grass-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-dev
Moritz Lennert

Re: vector libs: file based spatial index

Reply Threaded More More options
Print post
Permalink
In reply to this post by Moritz Lennert
On 24/06/09 22:49, Markus GRASS wrote:

> Moritz:
>> I'm not sure I understand everything correctly here, but I have the
>> feeling that there are two questions here:
>>
>> 1) Should we have a file-based storage of the spatial index ? This can
>> then be read into memory when necessary, which still should be faster
>> than rebuilding it each time.
> If an old vector is opened just for reading (v.what, v.info, probably
> also d.vect), the fastest solution is probably to only load the header
> of the spatial index, as is done for the coor file, and perform spatial
> queries in file. This is very fast AFAIKT.

Then the main issue is during editing ? I guess it then depends on the
use cases, but I don't know if frequent editing is something happening
very often on large files... IMHO, a slight speed penality for the
infrequent updates of vectors is acceptable compared to the huge
advantage of not having to rebuild the index every time.

>> To this I clearly say +1.
>> The question here is: how to make sure the index is up to date ?
> Keeping it up to date is not a problem per se with any method. The real
> issue here is whether to keep it on file or loading it to memory when
> modifying it, speed vs. memory consumption. And this is where I would
> like to get feedback, what is your experience, do larger vectors use too
> much memory or is vector processing relatively slow and should not get
> any slower?

I haven't used vector files, yet, that have caused memory problems, but
I have had serious speed problems...So, I would plead for whatever makes
things faster.

> Hmm yes, I think these massive datasets are still the exception, now and
> then someone tries to work with huge vectors but this is not the
> everyday case (maybe because it takes so long...).

They will probably become more normal (cf Lidar), so we should plan with
them in mind, without making everyone else "suffer" because a few people
need to use them.

 > I would rather want to hard-code
> the way of modifying a spatial index. There are different possibilities:
> 1) let the vector libs figure out what is best (very difficult)

-1

> 2) have
> an env variable (could work),

As this retains flexibility for the future, I would favor this, but have
no idea of what this entails in terms of increase of complexity of the code.

> 3) have a new flag in all vector modules
> (users shouldn't be bothered on module level with vector model details)

Yeah, I agree that this is probably not the best idea.

> 4) decide on a new standard method and hard-code that method as was done
> for the coor file which is never loaded to memory.

The largest file I have used is about 125000 areas with a topo file
weighing 42M, so taking your worst estimation, this would mean around
200MB of spatial index, which is still largely acceptable for me.

I find it a bit difficult to give you a definitive answer on the base of
theory alone. Do you have any means of testing the impact of one choice
over the other for different use cases (editing, v.build, v.what - the
latter especially when using in the GUI) ?

If the above is difficult, I would say go for your current preference
which seems to be file-based. Would it be possible (just thinking out
loud, without any idea what this entails) to work on two levels, with
high-level functions which then can call either file-based or
memory-based low-level functions ? This way you can create the
high-level API with a file-based system behind that, but allow future
creation of another memory-based "backend" if the need arises. E.g.
something like the high-level db functions which you can call whatever
the actual db driver.

Moritz
_______________________________________________
grass-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-dev
hamish-2

Re: vector libs: file based spatial index

Reply Threaded More More options
Print post
Permalink
In reply to this post by Markus Metz-2

Moritz wrote:
> The largest file I have used is about 125000 areas with a
> topo file weighing 42M, so taking your worst estimation,
> this would mean around 200MB of spatial index, which is
> still largely acceptable for me.

lidar and swath bathymetry data will easily have millions of points,
and as time goes on this will only expand. I seem to recall that one of
Radim's big disappointments was that the need to handle this technology/
data density only really became apparent just when GRASS's new vector
engine was nearing completion. With some earlier notice it could have
been designed to scale better. Still, there is much tuning which can
be done with the present model to reduce the memory overheads, etc.

FWIW the sites type (now vector points) in GRASS 4/5 scales well, just
as much as you can fit in the text file. (not sure if fseeks are 64bit-
proof there, probably not)

the biggest lidar file used that I know about is Doug's 379GB dataset
(14.5 billion points). The vector engine couldn't handle that* so r.in.xyz
was used. Certainly count on 5 million features with topology and DB table
for dataset sizes /today/.

* I don't know what limitation there is if imported without topology+
DB table.


In future memory, CPU, and HD sizes will only increase, but one thing I've
come to respect is that GRASS's raster modules scale so well today because
they were designed to function in the days of extremely tight memory and
CPU constraints.


you might look at libLAS (for lidar data -- an OSGeo semi-affiliated
project:   http://liblas.org/   It is my understanding that Howard is
currently adding spatial index support in the development version.
You might check out his approach.

I have been, and still am ignorant of what advantage a spatial index
gives you for point data. ... interested to learn why "topology" would
be useful for points-only data.


In general I'm fairly happy with the no-topology solution for lidar
data in grass, but a few targeted modules (eg v.info) really need to
be modified to deal with them.



Hamish


ps- we still need to hunt through the archives for Radim's posts on these
issues which explain quite a bit.



     

_______________________________________________
grass-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-dev
Markus Metz-2

Re: vector libs: file based spatial index

Reply Threaded More More options
Print post
Permalink
In reply to this post by Moritz Lennert
Moritz Lennert wrote:
> Markus:
>> If an old vector is opened just for reading (v.what, v.info, probably
>> also d.vect), the fastest solution is probably to only load the header
>> of the spatial index, as is done for the coor file, and perform spatial
>> queries in file. This is very fast AFAIKT.
>
> Then the main issue is during editing ?
Yes. Modules where a speed decrease will be observed are e.g. v.in.ogr,
v.clean, v.buffer. Some modules only build topology together with the
spatial index for a new vector at the end, when the new vector is ready,
e.g. v.select, v.kernel, in these cases there will be no obvious speed
difference.
> I haven't used vector files, yet, that have caused memory problems,
> but I have had serious speed problems...So, I would plead for whatever
> makes things faster.
Speed problems can have many reasons and need to be tackled one at a
time. Poor speed can be caused by either the module or the vector libs
or both.
>> Hmm yes, I think these massive datasets are still the exception, now and
>> then someone tries to work with huge vectors but this is not the
>> everyday case (maybe because it takes so long...).
>
> They will probably become more normal (cf Lidar), so we should plan
> with them in mind, without making everyone else "suffer" because a few
> people need to use them.
Lidar is a special case, I don't see a reason to drag along with them
topo and the spatial index, maybe the spatial index, but not topo (will
reply to Hamish too).
>
>> 2) have
>> an env variable (could work),
>
> As this retains flexibility for the future, I would favor this, but
> have no idea of what this entails in terms of increase of complexity
> of the code.
Two different rtree libraries, one memory based and one file based, need
to be available and maintained. Higher level stuff needs to prepare
temporary files only if needed, lower level stuff (diglib/spindex.c and
spindex_rw.c) need to handle both cases properly. Some functions must be
present in two versions, file based and memory based. Not sure how much
change is required in the headers, particularly dig_structs.h.

>
> The largest file I have used is about 125000 areas with a topo file
> weighing 42M, so taking your worst estimation, this would mean around
> 200MB of spatial index, which is still largely acceptable for me.
When testing LFS in the vector libs, I had vectors with > 2GB coor file,
800MB topo (>400000 areas). Importing them required 5 - 6GB memory,
handling them was simply a PITA, lots of coffee breaks.
>
> I find it a bit difficult to give you a definitive answer on the base
> of theory alone. Do you have any means of testing the impact of one
> choice over the other for different use cases (editing, v.build,
> v.what - the latter especially when using in the GUI) ?
v.build is slower when keeping the modified spatial index in file, but
v.build means rebuilding the spatial index, lots of file IO if
intermediate data are stored on file.

v.what is simply a charm when using from the gui :-), both with query in
display mode and query in edit mode (db speed is limiting). There is
nothing that needs to be rebuilt, a full 113 bytes are read (sidx
header), then search can immediately take place in file.

I would suggest that I first implement a new version were the spatial
index is always written out when a new or modifed vector is closed.
Intermediate data are still stored in memory. Opening an old vector in
read-only mode would then be faster, opening an old vector in update
mode would be the same like currently done, the spatial index is loaded
to memory. This can then be tested and polished, and once that is
stable, an env var could be added to keep the spatial index in file when
modifying (Vect_open_new or Vect_open_update). This would only be needed
for massive vectors.

Markus M

_______________________________________________
grass-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-dev
Paul Kelly

Re: vector libs: file based spatial index

Reply Threaded More More options
Print post
Permalink
In reply to this post by Markus Metz-2
On Wed, 24 Jun 2009, Markus GRASS wrote:

> Paul Kelly wrote:
>> On Tue, 23 Jun 2009, Markus GRASS wrote:
>>
>>> My implementation is completely file based, also when creating or
>>> updating a vector. This comes obviously with a speed penalty because
>>> reading in memory is faster than reading from file. With all sorts of
>>> tricks and relying on the system to cache files, the file based spatial
>>
>> Hi Markus, The last point above sounds interesting. What sort of tricks?
> When inserting or deleting or searching, the search patch is remembered
> (kept in memory) and walked back by using a stack in a non-recursive
> functions. This reduces file IO operations about by half and should be a
> faster algorithm than the recursive alternative. As an example,
> inserting a new item into an RTree that already has about 50,000 items
> requires a total of 3 file reads and 3 - 5 file writes, each file IO
> reads or writes 528 bytes on a 32 bit system and 586 bytes on a 64 bit
> system. And when using intermediate files, the whole chunk of 528/568
> bytes is read or written at once.

>> Are you using memory-mapped files?
>
> What are memory-mapped files? Excuse my ignorance, I'm just a
> self-trained coder (learning by doing).

http://en.wikipedia.org/wiki/Memory-mapped_file
A chunk of a disk file is directly mapped into memory so you can access it
using normal pointers as if it was permanently in memory, and the OS
transparently handles paging the relevant chunks of the file in and out
of memory as required.
It can be a useful and elegant way of managing access to large files a
section at a time. For files small enough to be completely cached in
memory the performance penalty over keeping the same data in memory
(without a file) should be relatively small, and will be faster than
accessing a file using conventional fopen()/fread() etc. functions.
But how much of a performance gain it would give for very large files
strongly depends on the nature of access to the file, which I know very
little about in this case. E.g. for completely random access there might
not be a lot of gain. But if there was random access only within a certain
section of a file, that section could be mapped into memory and access
would then be quite fast. Overwriting "dead" areas of the index would be
particularly simple I think, if it was memory mapped.

Paul
_______________________________________________
grass-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-dev
Markus Metz-2

Re: vector libs: file based spatial index

Reply Threaded More More options
Print post
Permalink
In reply to this post by hamish-2
Hamish wrote:

> Moritz wrote:
>  
>> The largest file I have used is about 125000 areas with a
>> topo file weighing 42M, so taking your worst estimation,
>> this would mean around 200MB of spatial index, which is
>> still largely acceptable for me.
>>    
>
> lidar and swath bathymetry data will easily have millions of points,
> and as time goes on this will only expand. I seem to recall that one of
> Radim's big disappointments was that the need to handle this technology/
> data density only really became apparent just when GRASS's new vector
> engine was nearing completion. With some earlier notice it could have
> been designed to scale better. Still, there is much tuning which can
> be done with the present model to reduce the memory overheads, etc.
>  
Yes. As an example, for a 2D point dataset, the topo file should be
about 4 times as large as the coor file, same for the spatial index.
This is because each x,y coordinate pair is stored 3 times in the topo
file, plus some other information that is for points not needed, e.g.
area/isle to the left and to the right, start node and end node (start
node = end node for points/centroids). Each x,y coordinate pair is
stored 2 times in the spatial index (rectangle of size zero with N S E W
and N = S, E = W). I see some potential for cleaning up.
> FWIW the sites type (now vector points) in GRASS 4/5 scales well, just
> as much as you can fit in the text file. (not sure if fseeks are 64bit-
> proof there, probably not)
>  
I guess that was without topo?
> the biggest lidar file used that I know about is Doug's 379GB dataset
> (14.5 billion points).
Frightening.
> you might look at libLAS (for lidar data -- an OSGeo semi-affiliated
> project:   http://liblas.org/   It is my understanding that Howard is
> currently adding spatial index support in the development version.
> You might check out his approach.
>  
Will do.
> I have been, and still am ignorant of what advantage a spatial index
> gives you for point data. ... interested to learn why "topology" would
> be useful for points-only data.
>  
Strictly speaking, topology and spatial index are two different things,
you could have a spatial index without topo. I can also not see the
usefulness of topology for point data. A spatial index may be useful to
extract a subset (v.select), but in this case you could just as well go
through the points in the coor file, read one at a time and select the
ones that fall into the study area. Should be slower than with a spatial
index but then you're not dragging along the spatial index.

>
> In general I'm fairly happy with the no-topology solution for lidar
> data in grass, but a few targeted modules (eg v.info) really need to
> be modified to deal with them.
>
>
>
> Hamish
>
>
> ps- we still need to hunt through the archives for Radim's posts on these
> issues which explain quite a bit.
>  
I remember one comment where he said that the spatial index is not
written out because of time and space concerns. Space should not be an
issue today, and opening an old vector is faster if the spatial index is
available in a file. Of course I would like a solution that needs less
memory and is faster when modifying a spatial index, but I have not the
faintest idea how to do that. Maybe Paul Kelly's tip on memory mapping
can help.

Markus M
_______________________________________________
grass-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-dev
hamish-2

Re: vector libs: file based spatial index

Reply Threaded More More options
Print post
Permalink
In reply to this post by Markus Metz-2

Markus M wrote:
> Lidar is a special case, I don't see a reason to drag along
> with them topo and the spatial index, maybe the spatial index, but
> not topo

there is nothing at all special about lidar data, it's just a bunch of
x,y,z points. (often with other variables like signal return strength)

if you prefer to think about a more vectory massive dataset try importing
the full OpenStreetMap PostGIS database into GRASS. Of even just the OSM
subset for your country of choice.


observation-prediction-suggestion: Bigger datasets are coming faster
than faster/bigger computers are, and will be very common in the future
(& the future is now in many cases). Keep overheads of all types low and
the rest will take care of itself.

aka life's a compromise, do your best.


Hamish



     

_______________________________________________
grass-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-dev
Markus Metz-2

Re: vector libs: file based spatial index

Reply Threaded More More options
Print post
Permalink
In reply to this post by Paul Kelly
Paul Kelly wrote:

> Markus:
>
>> What are memory-mapped files? Excuse my ignorance, I'm just a
>> self-trained coder (learning by doing).
>
> http://en.wikipedia.org/wiki/Memory-mapped_file
> A chunk of a disk file is directly mapped into memory so you can
> access it using normal pointers as if it was permanently in memory,
> and the OS transparently handles paging the relevant chunks of the
> file in and out of memory as required.
> It can be a useful and elegant way of managing access to large files a
> section at a time. For files small enough to be completely cached in
> memory the performance penalty over keeping the same data in memory
> (without a file) should be relatively small, and will be faster than
> accessing a file using conventional fopen()/fread() etc. functions.
> But how much of a performance gain it would give for very large files
> strongly depends on the nature of access to the file, which I know
> very little about in this case. E.g. for completely random access
> there might not be a lot of gain.
It is completely random, the next chunk to be read/written can be
anywhere in the file.
> But if there was random access only within a certain section of a
> file, that section could be mapped into memory and access would then
> be quite fast.
Does it make sense to always map a different section of the file? This
mapping would then need to replace each call to fread/fwrite because of
completely random access. I'm currently using the same method like for
the coor file which seems to work fine so far.

Markus M
_______________________________________________
grass-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-dev
Paul Kelly

Re: vector libs: file based spatial index

Reply Threaded More More options
Print post
Permalink
On Thu, 25 Jun 2009, Markus GRASS wrote:

[...]
>> very little about in this case. E.g. for completely random access
>> there might not be a lot of gain.
>
> It is completely random, the next chunk to be read/written can be
> anywhere in the file.
>
[...]
>> But if there was random access only within a certain section of a
>> file, that section could be mapped into memory and access would then
>> be quite fast.
>
> Does it make sense to always map a different section of the file? This
> mapping would then need to replace each call to fread/fwrite because of
> completely random access. I'm currently using the same method like for
> the coor file which seems to work fine so far.

No, I can't imagine that would give much improvement. The scenario I had
hoped for was that when processing a particular sub-region of a vector
map, all the needed data from the index would be near each other in memory
(and thus in the file), thus that area of the file (I was thinking up to
several hundred megabytes max. for performance reasons, but this could
obviously scale with increased datasizes) could be mapped into memory and
intensive I/O (or just reading) done there, with the memory-mapping
mechanism automatically taking care of the reads and writes from/to the
underlying file.

>From what you have said it sounds like that would need a complete redesign
of the algorithm used for indexing, or might not be possible. The other
option would be to map the whole of the index file into memory; it might
be worth trying, but as you have already tried to optimise reading/writing
performance based on knowledge of the algorithm, it is unlikely the OS
could do any better. Still might be worth trying, but maybe too much work.
So I think I should shut up now since (as is obvious) I don't know how the
indexing algorithm works...

Paul
_______________________________________________
grass-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-dev
Markus Metz-2

Re: vector libs: file based spatial index

Reply Threaded More More options
Print post
Permalink
Paul Kelly wrote:

> On Thu, 25 Jun 2009, Markus GRASS wrote:
>
> [...]
>>> very little about in this case. E.g. for completely random access
>>> there might not be a lot of gain.
>>
>> It is completely random, the next chunk to be read/written can be
>> anywhere in the file.
>>
> [...]
>>> But if there was random access only within a certain section of a
>>> file, that section could be mapped into memory and access would then
>>> be quite fast.
>>
>> Does it make sense to always map a different section of the file? This
>> mapping would then need to replace each call to fread/fwrite because of
>> completely random access. I'm currently using the same method like for
>> the coor file which seems to work fine so far.
>
> No, I can't imagine that would give much improvement. The scenario I
> had hoped for was that when processing a particular sub-region of a
> vector map, all the needed data from the index would be near each
> other in memory
Unfortunately not, you need the whole search tree also for processing a
subregion. The index is not a liner array but a search tree, and for a
search tree I don't know how to keep spatially close items close to each
other in file, particularly when modifying the tree.

> (and thus in the file), thus that area of the file (I was thinking up
> to several hundred megabytes max. for performance reasons, but this
> could obviously scale with increased datasizes) could be mapped into
> memory and intensive I/O (or just reading) done there, with the
> memory-mapping mechanism automatically taking care of the reads and
> writes from/to the underlying file.
>
> From what you have said it sounds like that would need a complete
> redesign of the algorithm used for indexing, or might not be possible.
> The other option would be to map the whole of the index file into memory;
Then we can just as well keep the whole index in memory as before ?
> it might be worth trying, but as you have already tried to optimise
> reading/writing performance based on knowledge of the algorithm, it is
> unlikely the OS could do any better. Still might be worth trying, but
> maybe too much work. So I think I should shut up now since (as is
> obvious) I don't know how the indexing algorithm works...
It's essentially a glorified binary search tree where a node has not 2
but (in this case) 10 children. As for BSTs, a search always starts at
the root of the tree and then descends down to end nodes. Such a search
is also performed for insertion and deletion. With the current
implementation, the location in file is completely mixed up and has very
little to do with spatial distance or search tree structure.

The principle of a search tree is very fast and should be kept for
speed. AFAIU the nature of the tree does not allow to have spatially
nearby features close to each other in file, if only because the root
node and following nodes spatially cover the whole vector, i.e. on that
lowest level all features are close to each other as far as the search
tree is concerned. I'm happy that I found a way to write out the index
in a way that allows search in file, always keeping memory requirements
down to a very few KB. There are other file-based RTree implementations,
but I don't understand them, cryptic code, no documentation (SQLite for
example has some RTree, didn't test it, also Norbert Beckmann's R*-tree
implementation).

Markus M


_______________________________________________
grass-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-dev
Moritz Lennert

Re: vector libs: file based spatial index

Reply Threaded More More options
Print post
Permalink
In reply to this post by Markus Metz-2
On 25/06/09 08:51, Markus GRASS wrote:
> I would suggest that I first implement a new version were the spatial
> index is always written out when a new or modifed vector is closed.
> Intermediate data are still stored in memory. Opening an old vector in
> read-only mode would then be faster, opening an old vector in update
> mode would be the same like currently done, the spatial index is loaded
> to memory. This can then be tested and polished, and once that is
> stable, an env var could be added to keep the spatial index in file when
> modifying (Vect_open_new or Vect_open_update). This would only be needed
> for massive vectors.

+1

Moritz
_______________________________________________
grass-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-dev
Doug_Newcomb

Re: vector libs: file based spatial index

Reply Threaded More More options
Print post
Permalink
In reply to this post by Markus Metz-2
Some javascript/style in this post has been disabled (why?)

>> the biggest lidar file used that I know about is Doug's 379GB dataset
>> (14.5 billion points).
>Frightening.


The above dataset was for two watersheds collected in 2001, the larger of the two watersheds is 9000 square miles . I've recently been working with newer lidar data ( 2007) from a single county with an area of 744 sq. miles ( Craven county, North Carolina, USA) . This county had lidar flown at a submeter posting ( 0.7m? as a guess). The aggregated ASCII x,y,z,intensity file that I created ( for processing using r.in.xyz) from the input las files for that single county was 80 GB .

I guess my point is that lidar datasets are getting quite massive. If we are going to be working with the lidar data as point data in the GRASS vector framework, go with the most scalable options. Scalability in working with large data sets is a huge benefit in using GRASS over other solutions.

Doug



Doug Newcomb
USFWS
Raleigh, NC
919-856-4520 ext. 14 [hidden email]
---------------------------------------------------------------------------------------------------------
The opinions I express are my own and are not representative of the official policy of the U.S.Fish and Wildlife Service or Dept. of Interior. Life is too short for undocumented, proprietary data formats.
Inactive hide details for Markus GRASS <markus.metz.giswork@googlemail.com>Markus GRASS <[hidden email]>



To

Hamish <[hidden email]>

cc

GRASS devel <[hidden email]>

Subject

Re: [GRASS-dev] vector libs: file based spatial index

Hamish wrote:
> Moritz wrote:
>  
>> The largest file I have used is about 125000 areas with a
>> topo file weighing 42M, so taking your worst estimation,
>> this would mean around 200MB of spatial index, which is
>> still largely acceptable for me.
>>    
>
> lidar and swath bathymetry data will easily have millions of points,
> and as time goes on this will only expand. I seem to recall that one of
> Radim's big disappointments was that the need to handle this technology/
> data density only really became apparent just when GRASS's new vector
> engine was nearing completion. With some earlier notice it could have
> been designed to scale better. Still, there is much tuning which can
> be done with the present model to reduce the memory overheads, etc.
>  
Yes. As an example, for a 2D point dataset, the topo file should be
about 4 times as large as the coor file, same for the spatial index.
This is because each x,y coordinate pair is stored 3 times in the topo
file, plus some other information that is for points not needed, e.g.
area/isle to the left and to the right, start node and end node (start
node = end node for points/centroids). Each x,y coordinate pair is
stored 2 times in the spatial index (rectangle of size zero with N S E W
and N = S, E = W). I see some potential for cleaning up.
> FWIW the sites type (now vector points) in GRASS 4/5 scales well, just
> as much as you can fit in the text file. (not sure if fseeks are 64bit-
> proof there, probably not)
>  
I guess that was without topo?
> the biggest lidar file used that I know about is Doug's 379GB dataset
> (14.5 billion points).
Frightening.
> you might look at libLAS (for lidar data -- an OSGeo semi-affiliated
> project:  
http://liblas.org/   It is my understanding that Howard is
> currently adding spatial index support in the development version.
> You might check out his approach.
>  
Will do.
> I have been, and still am ignorant of what advantage a spatial index
> gives you for point data. ... interested to learn why "topology" would
> be useful for points-only data.
>  
Strictly speaking, topology and spatial index are two different things,
you could have a spatial index without topo. I can also not see the
usefulness of topology for point data. A spatial index may be useful to
extract a subset (v.select), but in this case you could just as well go
through the points in the coor file, read one at a time and select the
ones that fall into the study area. Should be slower than with a spatial
index but then you're not dragging along the spatial index.
>
> In general I'm fairly happy with the no-topology solution for lidar
> data in grass, but a few targeted modules (eg v.info) really need to
> be modified to deal with them.
>
>
>
> Hamish
>
>
> ps- we still need to hunt through the archives for Radim's posts on these
> issues which explain quite a bit.
>  
I remember one comment where he said that the spatial index is not
written out because of time and space concerns. Space should not be an
issue today, and opening an old vector is faster if the spatial index is
available in a file. Of course I would like a solution that needs less
memory and is faster when modifying a spatial index, but I have not the
faintest idea how to do that. Maybe Paul Kelly's tip on memory mapping
can help.

Markus M
_______________________________________________
grass-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-dev





_______________________________________________
grass-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-dev

pic09789.gif (1K) Download Attachment
Markus Metz-2

Re: vector libs: file based spatial index

Reply Threaded More More options
Print post
Permalink
[hidden email] wrote:
>
>
> I guess my point is that lidar datasets are getting quite massive. If
> we are going to be working with the lidar data as point data in the
> GRASS vector framework, go with the most scalable options. Scalability
> in working with large data sets is a huge benefit in using GRASS over
> other solutions.
>
For the time being, the only reasonable way to deal with these massive
datasets is to *not* build topology. It's not not only the spatial index
that is getting out of hand, also topology itself and the category
index. The grass vector libs must be told that there is nothing special
about point datasets (to cite Hamish) which means rewriting major parts
of the vector libs, and that takes time.

Markus M

_______________________________________________
grass-dev mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-dev
1 2