Earthdistance and Cube: Simple, Fast, Approximate Spatial PostgreSQL Without PostGIS

GIS in PostgreSQL without PostGIS?

PostGIS is a fantastic open-source 3rd party extension for full geospatial work in PostgreSQL. If you’re not GIS-minded, however, and just want to do basic operations with the lat/long points you’ve gotten from either geocoding addresses or, say, from EXIF data embedded in pictures, then you may well be intimidated in traversing PostGIS’s documentation to figure out how to do the small subset of things you’re looking to do, especially starting with “what does all this SRID stuff mean?

If you need to relate your geocodes to arbitrary shapes like geopolitical boundaries, then PostGIS is where it’s at. No avoiding it. But if you just need to efficiently store and query your points, say, in terms of within a radius of an arbitrary lat/long, or within a rectangle described by a pair of lat/long, or perhaps find what 10 lat/long points you’ve got which are nearest to an arbitrary point (K-nearest-neighbors, aka KNN), then you can dive right in with stock PostgreSQL, leaving you with fewer dependencies and a simpler overall working system. And plus, at any time in the future, if/when you realize you need functionality that only PostGIS can provide, then upgrading your data to be managed and queried by PostGIS is just some legwork and a schema migration or two away. Until then, you may want to err on the side of YAGNI.

In this multi-part series, we’ll explore what can and cannot be done with basic PostgreSQL, then also show how to smoothly upgrade the schema to use PostGIS, migrate the data from the built-in PostgreSQL datatypes to PostGIS types, and then do some fancier things that you can only do in PostGIS.

Geometric Types and PostgreSQL

One of the core raison d’être’s of the PostgreSQL codebase, coming from its original implementation as INGRES and intermediate implementation PostGRES, was to support rich data types (PDF link), including easily user-defined types. Project founder Michael Stonebreaker’s team received funding for developing a relational database to support computer aided design work, leading to geometric types like points, lines, circles, and boxes being a part of the codebase for decades.

These types, functions, and operators are well documented, but what is lacking is a little bit of glue work and the knowledge of indexing types in order to put together a working system, as well as knowing what these types can and cannot do for you. In this article, I’m going to imagine that we’re working on a photo sharing website / database, as inspired by this Reddit question.

We’ll start by working with built-in types point, box, and circle, then we’ll introduce an arbitrary-dimensional cube type provided by the cube extension. Note that this cube is a type representing an arbitrary-dimensioned box in a coordinate space, not a cube as in Grouping Sets, Cube, and Rollup OLAP functionality. Ugh! Terminology conflict!

Geocodes as Points

Latitude and longitude name a point on earth using “graticule” coordinates, with latitude measuring the degrees north of the equator and longitude being degrees west of Greenwich, England. Degrees latitude are given the range -90º (south pole) to +90º (north pole), while degrees longitude are measured from 0º to -180º going west from Greenwich, through the Atlantic and North America to the antipodal opposite of Greenwich, the anti-meridian. The eastern hemisphere is measured in positive degree values from 0º to +180º. Two formats can be used, either decimal degrees as already described, or degree / minute / seconds (DMS), in which a directional indicator (N, S, E, W) stands in place of +/-, with minutes being 1/60 of a degree, and seconds being 1/60 of a minute.

We will be storing lat/long in point columns, but beware: human convention is to refer to these two in (y, x) order (latitude is the y value), but Cartesian coordinate convention is (x, y), so we must take care to reverse when storing, and then again when presenting as lat/long values.

Furthermore, since DMS is used in the EXIF standard, we’ll also need to be aware of how to convert those values to decimal degrees for storage — a job best done by whatever code you’d be using on the front- or middle-tier to process JPEGs and extract the EXIF. This blog post does a nice job covering how you can extract EXIF tags and then convert from DMS to decimal degrees in python. We’ll be skipping those parts and just talk PostgreSQL.

The PostgreSQL point datatype stores a pair of floating point values, interpreted as an X and a Y in some sort of Cartesian plane. You can extract or manipulate either component using array index notation, with X being [0], and Y being [1]:

poor_gis=# select point(0,3);
 point
-------
 (0,3)
(1 row)

poor_gis=# select '(3.234, 4.2343)'::point;
     point
----------------
 (3.234,4.2343)
(1 row)

To extract X or Y while also providing an inlined point, we’ve got to use an extra layer of parenthesis in order to appease the grammar, regardless of which constructor flavor used:

poor_gis=# select (point(0,3))[0];
 point
-------
     0
(1 row)

poor_gis=# select (point(0,3))[1];
 point
-------
     3
(1 row)


poor_gis=# select ('(0,3)'::point)[1];
 point
-------
     3
(1 row)

Let’s imagine we’re working on a photo gallery website. We’re going to store the filename as found in a 3-layer deep directory hierarchy, the image width and height, and then the image’s geocode as extracted from the EXIF using a point column. Imagine other metadata also stored, perhaps in a jsonb column corresponding to all of the rest of the EXIF …

poor_gis=# create table image
(
    id serial primary key,

    filename text not null unique,

    -- Primary metadata
    width int not null
        check (width > 0),
    height int not null
        check (height > 0),

    -- Imagine from EXIF metadata, converted from DMS to decimal degrees
    -- longitude (X) and latitude (Y) ...
    geocode_point point
        constraint sane_longitude check (
                    geocode_point[0] between -180 and 180
        ),
        constraint sane_latitude check (
                    geocode_point[1] between -90 and 90
        ),

    -- other / all metadata ...
    exif jsonb
);

CREATE TABLE

poor_gis=# insert into image
  (filename, width, height, geocode_point)
values
  ('/a/a/345sdf.jpg', 200, 500, '(1,4)' ), -- will get id 1,
  ('/a/b/654sdg.jpg', 600, 900, '(3,7)' ), -- 2, ...
  ('/a/c/765gdf.jpg', 1200, 1200,'(23.43, -34.44)' ),
  ('/a/d/8354gdfg.jpg', 1375, 966,'(-23, 55)' ),
  ('/a/e/76dfg.jpg', 3455, 5220, '(-129.234, -67.34)' )
;

INSERT 0 5

-- Row(s) where X is negative, i.e in the western hemisphere ...
poor_gis=# select * from image where geocode_point[0] < 0;
 id |     filename      | width | height |   geocode_point   | exif
----+-------------------+-------+--------+-------------------+------
  4 | /a/d/8354gdfg.jpg |  1375 |    966 | (-23,55)          |
  5 | /a/e/76dfg.jpg    |  3455 |   5220 | (-129.234,-67.34) |


-- Rows where Y is > 10, i.e. 10º north of equator ...
poor_gis=# select * from image where geocode_point[1] > 10;

 id |     filename      | width | height | geocode_point | exif
----+-------------------+-------+--------+---------------+------
  4 | /a/d/8354gdfg.jpg |  1375 |    966 | (-23,55)      |
(1 row)

Excellent, simple basics are sane.

Let’s make things more interesting by adding one million more rows with equally random metadata, which we’ll start to query against after a sidebar. The main complexity here is coming up with the random filenames …

do $$
    declare
        chars text[] =
            '{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}';

        num_chars int := array_length(chars, 1);
        num_nums int := 10;

        filename text;

        i int; j int;
    begin

        for i in 1..1000000
        loop

            -- Make realistic-ish filenames per above!
            -- Ala "/g/e/dfgdfger.jpg"
            filename :=    '/' || chars[1+random()*(num_chars-1)]
                        || '/' || chars[1+random()*(num_chars-1)]
                        || '/';

            for j in 1..7 loop
                filename := filename || chars[1+random()*(num_chars-1)];
            end loop;

            filename := filename || '.jpg';

            insert into image (filename, width, height, geocode_point)
            values (
                filename,

                -- Schizophrenic aspect ratios be damned!
                1+(random()*6400)::int, -- width
                1+(random()*3200)::int, -- height

                point(
                    random() * 360 - 180, -- [-180 -> +180]
                    random() * 180 - 90 -- [-90 -> +90]
                )
            );

        end loop;

    end;
$$;

We need to be sure to analyze the table after the 1M rows are inserted to ensure that the planner statistics are aware that this is a big table now. Otherwise, if the planner’s statistics still showed the table as having only a handful of rows in a single block, running the subsequent queries may cause indices to be ignored, given that a sequential scan through a handful of blocks is always going to be faster than doing random access through an index, then going back to the table (heap) file if the overall row / block count in the table is low. The autovacuum service is going to update the statistics for you sometime in the near future asynchronously, but you might well have pasted the subsequent index creation and queries before it ran, which would then confuse even the best of us.

poor_gis=# analyze image;
ANALYZE

PostgreSQL has a rich set of operators atop its geometric types, including points. For our examples, we’ll focus on a few binary operators:

  • Distance Between: <-> (Euclidean distance between operands at their closest points)
  • Overlaps: && (any point in common between left hand side and right hand side?)
  • Contains: @> (left side completely contains right side?)
  • Contained By: <@ (left side completely contained by right side?)
  • Is Strictly Right of: << (all X values in LHS less than all X values in RHS?)
  • Is Strictly Left of: >> (all X values in LHS greater than all X values in RHS?)
  • Is Strictly Below: <<| (all Y values in LHS less than all Y values in RHS?)
  • Is Strictly Above: >>| (all Y values in LHS greater than all Y values in RHS?)

When working with points as one of the two operands, there’s no semantic difference between using the overlaps or either flavor of the contains operators, since a point is either both or neither contained and overlapping, never only one of those.

Note that, since PostgreSQL lets users and extensions define new meanings for operators, you may find that the meaning of some of these built-in PostgreSQL operators vary from the meanings defined for PostGIS types, even using the same symbolic spelling.

Let’s rephrase our “Row(s) where X is negative” query, but this time comparing the stored points with a literal point, using the “is strictly left of” operator:

poor_gis=# select count(*) from image where geocode_point << '(0, 0)';
 count
--------
 499894
(1 row)

Cool, querying points given a point. (Your randomly built data will vary, of course, but we do expect around half a million rows)

Real-world Use Case #1: Points Within Box

We can do fancier queries — like, say, asking for points “within a box.” A native PostgreSQL box is a rectangle defined by the coordinates of two opposite corners:

poor_gis=# select count(*)
            from image
            where geocode_point <@ '(10, 10), (20, 20)'::box;

 count
-------
  1575
(1 row)

This is a very common GIS operation: “What geographic features do I have within arbitrary corners of a lat/long rectangle, such as the rectangle described by the viewport of a Google Maps window?”

Now, how did PostgreSQL come up with the ~1,500 out of 1M rows this time?

poor_gis=# explain select count(*)
            from image
            where geocode_point <@ '(10, 10), (20, 20)'::box;

                                      QUERY PLAN
---------------------------------------------------------------------------------
 Finalize Aggregate  (cost=15555.62..15555.63 rows=1 width=8)
  ->  Gather  (cost=15555.40..15555.61 rows=2 width=8)
      Workers Planned: 2
      ->  Partial Aggregate  (cost=14555.40..14555.41 rows=1 width=8)
            ->  Parallel Seq Scan on image  (cost=0.00..14554.36 rows=417 width=0)
                Filter: (geocode_point <@ '(20,20),(10,10)'::box)
(6 rows)

It did so by performing a parallel sequential scan — splitting the table up into chunks (2 in this case), then having workers filter every row in each assigned chunk against the operator. That is, it examined all 1M rows in the table. Boo! Inefficient!

We can make this efficient using the same technology PostGIS would: a space-partitioning GiST index. In the case of SP-GiST indexing the point type, we will create a multi-dimensional index similar to an R-tree, which optimizes queries for multi-dimensional datatypes. (To be more precise, a space-partitioning quad tree is created by PostgreSQL’s src/backend/utils/adt/geo_spgist.c)

poor_gis=# create index image_geocode_point_idx on image using gist(geocode_point);
CREATE INDEX

Repeating the query, we now get an entirely different plan:

poor_gis=# explain select count(*)
            from image
            where geocode_point <@ '(10, 10), (20, 20)'::box;

                            QUERY PLAN
-------------------------------------------------------------------------
 Aggregate  (cost=35.28..35.29 rows=1 width=8)
   ->  Index Only Scan using image_geocode_point_idx on image
              (cost=0.29..32.78 rows=1000 width=0)
        Index Cond: (geocode_point <@ '(20,20),(10,10)'::box)
(3 rows)

This plan uses an entirely different algorithm: a single worker accesses the SP-GiST index for the rows matching the “contained by box” expression. It just so happens that all of the requirements for only needing to consult the index, and not also the heap file are met, resulting in choosing to use the “index only scan” algorithm. This results in far, far less I/O, be it between storage and RAM, or even between RAM and L1/L2/L3 cache, and even between L1 cache and registers. Most database operations are I/O bound, so reducing any sort of I/O means reducing time to complete.

How do times compare? Here’s the EXPLAIN ANALYZE results, showing timings and actual visited rows, not just expected rowcounts, from before the index was created:

 Finalize Aggregate (cost=15555.62..15555.63 rows=1 width=8)
                        (actual time=49.587..49.587 rows=1 loops=1)
   ->  Gather  (cost=15555.40..15555.61 rows=2 width=8)
                    (actual time=49.482..50.770 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate
                    (cost=14555.40..14555.41 rows=1 width=8)
                        (actual time=44.629..44.629 rows=1 loops=3)
               ->  Parallel Seq Scan on image
                        (cost=0.00..14554.36 rows=417 width=0)
                            (actual time=0.134..44.534 rows=525 loops=3)
                     Filter: (geocode_point <@ '(20,20),(10,10)'::box)
                     Rows Removed by Filter: 332810
 Planning Time: 0.138 ms
 Execution Time: 50.831 ms
(10 rows)

Each of the two parallel seq scan workers visited hundreds of thousands of rows, removing the vast majority which fail the filter expression. Since the dataset sits inside RAM on this machine, it still ends up operating very quickly: 51ms.

After the index, though:

 Aggregate  (cost=35.28..35.29 rows=1 width=8)
                (actual time=0.865..0.865 rows=1 loops=1)
   ->  Index Only Scan using image_geocode_point_idx on image
            (cost=0.29..32.78 rows=1000 width=0)
                    (actual time=0.077..0.673 rows=1575 loops=1)
         Index Cond: (geocode_point <@ '(20,20),(10,10)'::box)
         Heap Fetches: 0
 Planning Time: 0.188 ms
 Execution Time: 0.917 ms
(6 rows)

Same results, but the SP-GiST index gives it to us in 1/55th of the execution time.

A SP-GiST index is exactly the same sort of technology PostGIS uses to “make spatial queries go fast.” As you can see here, it isn’t a feature limited to PostGIS! This is out-of-the-box PostgreSQL. In fact, the PostGIS indexing code is extremely similar to the PostgreSQL core implementation we saw earlier.

So, imagine the website includes a Google map window. Scrolling and zooming through the Google map window produces a new latitude / longitude bounding box. You could then fire off an AJAX request with the bounding box coordinates to your middleware, which then returns both the count of pictures and the count of distinct geocodes for said pictures taken within the viewport. If either number is sufficiently low, then your AJAX results could not only include the count, but also the list of metadata and filenames for the contained pictures. Your front-end code could then plot markers on the map for the pictures. Selecting a marker could display the picture at that point!

This SP-GiST index’s utility isn’t limited to “things inside or overlapping a box” (which is exactly what a bounding-box comparison is). We can do “points within a circle” easily also. Circles in vanilla PostgreSQL are described by a center point and a radius:

poor_gis=# explain analyze select count(*) from image
                       where geocode_point <@ '((15, 15), 5)'::circle;

 Aggregate  (cost=35.28..35.29 rows=1 width=8)
                 (actual time=0.609..0.609 rows=1 loops=1)
   ->  Index Only Scan using image_geocode_point_idx on image
                 (cost=0.29..32.78 rows=1000 width=0)
                              (actual time=0.076..0.488 rows=1212 loops=1)
         Index Cond: (geocode_point <@ '<(15,15),5>'::circle)
         Heap Fetches: 0
 Planning Time: 0.075 ms
 Execution Time: 0.636 ms
(6 rows)

Fast? Yes. Used our same index? Yes. But, meaningful? To determine that, we’ll first stroll through some GIS and coordinate system mapping details, as well as introduce the other sort of geographic query solvable within non-PostGIS PostgreSQL …

K Nearest Neighbors

Querying for “what is close to this point” is a very common need with spatial datasets, and would be especially useful in our semi-contrived photograph explorer web application. Let the user either type in an address (then to be geocoded), or select an arbitrary point on the map, and then, say, pull back the nearest 15 pictures to that point. A query to do that would just need an “order by distance-from-this-point limit K” clause. But alas, determining the distance between earth coordinate pairs isn’t quite so simple…

Lat/Long versus Aspect Ratio And Great Circle Distances

One thing the astute may notice is that the range of degrees east / west (-180º <-> +180º: 360º total) is twice that of north / south (-90º <-> +90º: 180º total). This means we’ve got sort of a non-square aspect ratio — even excluding that we’ve got degree coordinates naming points on a sphere. If we were dealing with purely Cartesian coordinates, one unit north/south is equivalent to two units east/west — both are exactly 1/90th of the range. Therefore an actual square box (again, currently interpreting as if Cartesian and not degree coordinates) would need to be twice as wide as it is tall in order to be actually “square.” A box that reads “square” actually casts a rectangle over this most-simple projection: '(0,0), (1, 1)'::box covers twice the latitudinal range as it does longitudinally (aka “is twice as tall as it is wide“). You could, if you wish, correct for this by casting for a pre-adjusted rectangle by dividing the width in half: '(0,0), (.5, 1)'::box.

Likewise, to cast a “true circle” over this warped projection, you’d need to actually cast a pre-adjusted ellipse. Unfortunately, here’s an area where vanilla PostgreSQL types for bull-headed non-PostGIS geographic database applications falls short: there’s no ellipse type!

To make matters worse, once you begin to consider additional literal real world factors like:

  1. The number of miles (or meters, or whatever linear measurement) in a degree longitude depends on what latitude you’re at, even with a purely spherical earth model. If you’re a hair’s breadth north of the south pole, walking a degree west is trivial (well, by trivial, I mean 100 meters in any direction is a real slog, very different from a walk in the park), but at the equator, traveling 1 degree due west means walking 69 miles (again, assuming the earth is truly spherical, with a radius of ~3,958 miles). Fortunately, the distorted warping is minimal where most of us live, the tropics of Cancer and Capricorn, so we can usually determine the conversion factor given a single latitude in a query and then use the single result when determining how to adjust the bounding box when accessing a spatial index.
  2. The earth isn’t truly spherical, but instead bulges at the equator due to centrifugal force. Modeling the earth much more accurately requires projecting the graticule coordinates onto the surface of an oblate spheroid.
  3. The Earth isn’t just an oblate spheroid, but has mountains and valleys. One mile west->east as the crow flies across Yosemite valley, from El Capitan across to Half Dome is a much shorter path than what a walking and rock-climbing human or ant must cover to travel in the crow’s exact shadow at noon. Meanwhile, a mile west->east in Kansas is much closer to a stroll in the park.
  4. The Earth isn’t static, but is made up of slowly moving continental plates. Features within the same plate tend to stay the same distance from each other, but will be drifting in unison relative to features in other plates. This drift is slow, but definitely happens. This makes large-scale earth measurements skew over time — the distance from New York to London is growing about an inch a year due to the split in the mid-Atlantic ridge, but New York is not getting closer to Los Angeles.

… then you realize that it isn’t quite so simple to ask for “all geocodes in database within some number of linear miles from this arbitrary geocode” if you want to be accurate.

PostgreSQL bundled extension earthdistance models factor 1’s spherical earth by implementing functions and operators for determining the great cicle distance (distance along the surface, instead of a straight line tunneling through the interior) between points on the surface of the earth for two different datatypes: the one we’ve been using up to now, point, interpreting the point’s X and Y as two-dimensional degrees longitude and latitude on the surface of a purely spherical earth, and a new datatype / representation provided by a separate bundled extension, cube. The earthdistance cube representation projects points on a purely spherical earth as three dimensional points in space centered at the middle of a perfectly spherical (and not spinning) earth. In the cube representation, a single 3-d point would either be in space, on the surface of the earth, or in the interior of the planet. The module provides a domain over cube, earth, which constraints a cube to be a 3-space point generally near the surface of the earth, as measured in whatever units the function named earth() returns. The module’s default implementation of the earth() function returns 6,378,168 (meters). If you would prefer to work in miles instead, you may just redefine this function to to return miles, as in:

create or replace function earth()
    returns float immutable parallel safe
    as 'select 3658.8::float' language sql;

We’ll leave things be here, however, and just make use of the default implementation returning approximately six million meters.

Ignoring all this talk about cube and so forth to start with, let’s try the one operator the extension provides for ‘great circle distance in miles between two points, point <@> point, described at the very bottom of the earthdistance extension’s documentation. This operator looks similar to the built-in point <-> point operator returning the Euclidean distance (you know, Pythagoras’s theorem and whatnot), but accounts for the spherical nature of the point coordinates, the “aspect ratio” disparity between longitude and latitude values, and for also traveling along the curvature of a sphere instead of a straight line through the earth. The original operator we played with, point <-> point, is only accurate for Cartesian coordinates on a flat plane or for certain freethinkers.

Let’s compare the results of asking for the closest pictures near a spot in Kansas. First, using the wrong operator, assuming a flat plane and Cartesian coordinates …

poor_gis=# select id, geocode_point,
    geocode_point <-> point(-89.8555855080485, 48.6482342518866)
                                        as flat_earth_graticule_distance
    from image
    order by geocode_point <-> point(-89.8555855080485, 48.6482342518866)
    limit 5;

   id   |            geocode_point             | flat_earth_graticule_distance
--------+--------------------------------------+----------------------------
 598209 | (-89.9676686525345,48.6067019216716) |          0.119530605833611
 515449 | (-89.9886091612279,48.6989216320217) |          0.142353443267662
 220796 | (-89.7144250385463,48.9332739450037) |          0.318078456991974
 141547 | (-89.6269739605486,48.8807080127299) |          0.326047985932906
 499558 | (-89.5203682221472,48.6719857994467) |          0.336057680731329

(5 rows)

And now the results when using the more accurate great circle distance operator, first installing both earthdistance and its dependency, cube:

poor_gis=# create extension cube;
CREATE EXTENSION
poor_gis=# create extension earthdistance;
CREATE EXTENSION
poor_gis=# select id, geocode_point,
    geocode_point <@> point(-89.8555855080485, 48.6482342518866)
                                    as spherical_earth_distance_miles
    from image
    order by geocode_point <@> point(-89.8555855080485, 48.6482342518866)
    limit 5;

   id   |            geocode_point             | spherical_earth_distance_miles
--------+--------------------------------------+--------------------------------
 598209 | (-89.9676686525345,48.6067019216716) |               5.86804759696775
 515449 | (-89.9886091612279,48.6989216320217) |               7.00722296641611
 499558 | (-89.5203682221472,48.6719857994467) |                15.386314146465
 141547 | (-89.6269739605486,48.8807080127299) |               19.1416237881716
 220796 | (-89.7144250385463,48.9332739450037) |               20.7159950259304
(5 rows)

Interesting! The difference in distance measurement algorithms produces different results against this random test data! Images 598209 and 515449 are the closest under either metric, but the ordering of the remainders vary according to the metric used. Again, the great-circle metric as performed by <@> is “better” and more accurate here.

The values for the distance operators certainly look different. In the first example, since our points are in decimal degrees graticule, the distance between points are also in graticule units, but in arbitrary (and unspecified by this query) directions. The second example gives us statute miles at least, a much more meaningful value. KNN queries often don’t care about the actual distance between neighbors, just who the neighbors are, but seeing the values is informative in this case.

Let’s see the plans, however …

# explain analyze select * from image
  order by geocode_point <-> point(-89.8555855080485, 48.6482342518866)
  limit 5;

 Limit  (cost=0.29..0.52 rows=5 width=53)(actual time=0.150..0.176 rows=5 loops=1)
   ->  Index Scan using image_geocode_point_idx on image
                    (cost=0.29..47670.89 rows=1000005 width=53)
                            (actual time=0.148..0.173 rows=5 loops=1)
    Order By: (geocode_point <-> '(-89.8555855080485,48.6482342518866)'::point)
 Planning Time: 0.171 ms
 Execution Time: 0.228 ms
(5 rows)

Awesome. The SP-GiST index we created previously has our back here.

What about the performance of the more accurate query?

# explain analyze select * from image
  order by geocode_point <@> point(-89.8555855080485, 48.6482342518866)
  limit 5;


 Limit  (cost=22475.11..22475.69 rows=5 width=53)
                (actual time=161.691..163.953 rows=5 loops=1)
   ->  Gather Merge  (cost=22475.11..119704.66 rows=833338 width=53)
                            (actual time=161.690..163.949 rows=5 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=21475.08..22516.75 rows=416669 width=53)
                            (actual time=156.645..156.646 rows=4 loops=3)
              Sort Key: ((geocode_point
                            <@> '(-89.8555855080485,48.6482342518866)'::point))
              Sort Method: top-N heapsort  Memory: 26kB
              Worker 0:  Sort Method: top-N heapsort  Memory: 26kB
              Worker 1:  Sort Method: top-N heapsort  Memory: 26kB
              ->  Parallel Seq Scan on image
                            (cost=0.00..14554.36 rows=416669 width=53)
                                (actual time=0.028..95.691 rows=333335 loops=3)
 Planning Time: 0.155 ms
 Execution Time: 163.994 ms
(12 rows)

Whaaaat? No fancy index support?? Well, given that the point <@> point operator did not even exist in our database when we created index image_geocode_point_idx, we ought not be too surprised. Perhaps all that talk about cube in earthdistance has some merit? Especially the part reading “… In addition, a cube GiST index can be used to find nearest neighbors using the metric operators <->, <#>, and <=> in ORDER BY clauses. For example, the nearest neighbor of the 3-D point (0.5, 0.5, 0.5) could be found efficiently with …

So — let’s try with a second index, a functional one cross-converting our stored points in latitude and longitude to cube / earth points in 3-space, then KNN querying using the cube <-> cube operator in the order by clause. Function ll_to_earth(longitude float, latitude float) converts lat/long to an earth aka cube point in 3-space. Sigh, ll_to_earth wants its parameters in latitude, longitude order, so invert our lovely (X, Y) pairs as we decompose …

# create index image_geocode_as_cube
    on image
    using gist(ll_to_earth(geocode_point[1], geocode_point[0]));

CREATE INDEX

Now let’s try an equivalent KNN query, but this time given the distance from converting our Kansas lat/long to an earth / cube (and passing lat, long in the right order both invocations):

# select id, geocode_point,
    ll_to_earth(geocode_point[1], geocode_point[0])
        <-> ll_to_earth(48.6482342518866, -89.8555855080485)
                as cube_threespace_distance_meters
 from image
  order by
    ll_to_earth(geocode_point[1], geocode_point[0])
        <-> ll_to_earth(48.6482342518866, -89.8555855080485)
  limit 5;

    id   |            geocode_point             | cube_threespace_distance_meters
--------+--------------------------------------+---------------------------------
 598209 | (-89.9676686525345,48.6067019216716) |                9454.35088669499
 515449 | (-89.9886091612279,48.6989216320217) |                 11289.741773158
 499558 | (-89.5203682221472,48.6719857994467) |                24789.7673198475
 141547 | (-89.6269739605486,48.8807080127299) |                 30840.150001606
 220796 | (-89.7144250385463,48.9332739450037) |                33376.7027558878
(5 rows)

Excellent, it matches the expected results as determined by fancy-pants operator point <@> point. What about the plan?

# explain analyze select * from image
  order by
    ll_to_earth(geocode_point[1], geocode_point[0])
        <-> ll_to_earth(48.6482342518866, -89.8555855080485)
  limit 5;


 Limit  (cost=0.41..1.91 rows=5 width=85)(actual time=0.517..0.715 rows=5 loops=1)
   ->  Index Scan using image_geocode_as_cube on image
                (cost=0.41..300280.76 rows=1000005 width=85)
                        (actual time=0.516..0.713 rows=5 loops=1)
    Order By: ((ll_to_earth(geocode_point[1],
            geocode_point[0]))::cube
                <-> '(10621.2320983345, -4213915.6225754, 4787883.5983624)'::cube)
 Planning Time: 0.361 ms
 Execution Time: 0.804 ms
(5 rows)

Booyah! Same order of magnitude as using the (wrong) Euclidean distance comparison! Our downside is that we’ve now got two different indices to store and update, plus a very different query style for KNN than we have for our bounding-box query. For these 1M rows, the indexes weigh …

# select relname, pg_size_pretty(pg_relation_size(oid))
  from pg_class
  where relname like 'image_geocode%';

         relname         | pg_size_pretty
-------------------------+----------------
 image_geocode_as_cube   | 83 MB
 image_geocode_point_idx | 71 MB
(2 rows)

Not too bad overall for 1M geocodes, but the table size itself is only 73MB, so we’ve tripled our storage costs. Your level of distaste for a pair of indices over two rather different representations of geocodes may vary, but:

  1. We will be using the geocodes of our images in lat/long form when we display. Keeping them immediately available is paramount.
  2. We have a major use case of displaying all image points which fall inside a reasonably-zoomed map window. Said map window will be in terms of a lat/long box, so being able to query using the box coordinates is simplest and fastest against the SP-GiST index atop those 2D points.
  3. Our other use case, determining an accurate-enough ordering of the actual pictures nearest to an arbitrary point requires a better, spherical, model of the earth for a better value of accurate-enough.

Let’s examine the difference between the simple Cartesian distance versus the great circle distance for a larger number of points to see where things start to break down in my sample dataset. The windowing function rank(), invoked with ...over (order by ...), is handy for showing the position number of this particular row against its peers when the peers are ordered by some sort of ordering.

Through some trial and error, here’s a spelling letting us compare the closest 30 by-way-of-great-circle versus where those points would have ranked if using the Cartesian distance, while also being fast (< 2ms). I used CTEs to separate out each separate plannable-by-KNN-search path for each different KNN-supporting index, then join the together by picture id to cross-compare. The limits within both the OVER and LIMIT clauses happened to give me complete data projected here. Your random dataset may require different values.

poor_gis=# with by_cube as (

    /* id, great-circle distance, rank (position of
                        row in the set) for closest 30
                        by great-circle metrics */

    select id,

    ll_to_earth(geocode_point[1], geocode_point[0])
       <-> ll_to_earth(48.6482342518866, -89.8555855080485)
            as cube_3d_meters,

    rank() over (order by
            ll_to_earth(geocode_point[1], geocode_point[0])
                <-> ll_to_earth(48.6482342518866, -89.8555855080485)
                rows between 30 preceding and current row)
       as cube_rank

    from image
    order by 2 limit 30

), by_cartesian as (

        /* id, flat-earth-degree distance,
                        rank for closest 60
                        by great-circle metrics */

        select id,

        geocode_point <-> point(-89.8555855080485, 48.6482342518866)
            as flat_degrees,

        rank() over (order by
                geocode_point <-> point(-89.8555855080485, 48.6482342518866)
                rows between 60 preceding and current row)
            as cartesian_rank

        from image
        order by 2 limit 60
)

/* Join both together, but handling if a by-great-circle top 30 point
    isn't in the top 60 by-flat-earth set */

select bcu.id, bcu.cube_3d_meters as g_c_dist, bcu.cube_rank as by_cube,
            bct.flat_degrees as deg_dist, bct.cartesian_rank as by_flat,
            cartesian_rank - cube_rank as rank_diff
from by_cube bcu
    left join by_cartesian bct
        using (id)

order by 3;

   id   |     g_c_dist     | by_cube |     deg_dist      | by_flat | rank_diff
--------+------------------+---------+-------------------+---------+-----------
 598209 | 9454.35088669499 |       1 | 0.119530605833611 |       1 |         0
 515449 |  11289.741773158 |       2 | 0.142353443267662 |       2 |         0
 499558 | 24789.7673198475 |       3 | 0.336057680731329 |       5 |         2
 141547 |  30840.150001606 |       4 | 0.326047985932906 |       4 |         0
 220796 | 33376.7027558878 |       5 | 0.318078456991974 |       3 |        -2
 780164 | 34126.7112106395 |       6 | 0.400569721968307 |       7 |         1
 139998 | 38427.3933242223 |       7 | 0.347431064471505 |       6 |        -1
 419038 | 42641.1099952202 |       8 | 0.414454739876302 |       8 |         0
 260518 | 46821.4967775541 |       9 | 0.635773466990049 |      19 |        10
 689067 | 49951.4480842041 |      10 | 0.677215642366855 |      21 |        11
  40360 | 50417.8640957375 |      11 | 0.486064134504151 |       9 |        -2
 170628 | 53237.0844513251 |      12 | 0.623723300321995 |      18 |         6
 732650 | 53492.9726637618 |      13 | 0.574323584575311 |      15 |         2
 787789 | 55461.1710907472 |      14 | 0.598400501558983 |      17 |         3
 968891 |  55778.540935383 |      15 | 0.503582023669556 |      10 |        -5
 695971 | 56695.5009533287 |      16 | 0.509681921765089 |      11 |        -5
 831892 | 57272.5646100655 |      17 | 0.550748722184749 |      13 |        -4
 238526 | 59694.1205504939 |      18 | 0.806711096834358 |      24 |         6
 476373 | 60509.6480280316 |      19 | 0.550708332205125 |      12 |        -7
 388590 | 60655.8889045443 |      20 | 0.570266337375791 |      14 |        -6
 346148 | 63336.0461109318 |      21 | 0.576721033876329 |      16 |        -5
 270798 |  63864.771530601 |      22 | 0.692437548795166 |      22 |         0
 340999 | 68912.4352427674 |      23 | 0.815154253360147 |      25 |         2
 747903 | 69399.6477783806 |      24 | 0.636585477224388 |      20 |        -4
 166103 | 72992.9160155629 |      25 | 0.965187191143913 |      39 |        14
  38988 | 73126.8620122719 |      26 | 0.817025498072814 |      26 |         0
 808750 | 73911.7287023302 |      27 | 0.906290409593806 |      35 |         8
 504684 | 74466.0850680456 |      28 | 0.886736739457095 |      33 |         5
 279756 | 74972.2174697155 |      29 | 0.941260697642388 |      38 |         9
 546620 | 78587.8861189611 |      30 | 0.940726308160501 |      37 |         7
(30 rows)

You can see that as the great circle distance grows, said point’s as-ordered-by flat-earth degree distance position has more and more errors. The largest observed in this sample was being off by fourteen positions —- the point deemed 25th closest by great circle distance was thought to be 39th closest by flat-earth distance.

End of Part One

We’ve seen how to tack on some GIS features onto a dataset / web application from the bottom up —- starting with the simplest possible representation allowing for fast spatial queries, then expanded our technological requirements minimally as we considered improving our not-quite-accurate KNN query results. We discussed the limitations in our implementation, namely needing two separate SP-GiST indices to support either “all points within a lat/long-specified rectangle” as well as relatively-accurate “nearest K points from an arbitrary point” queries. We’ve touched on many of the complexity issues cartographers and GIS systems face when projecting the surface of the Earth onto either a grid, or even as points in a simplified 3D space. We’ve demonstrated that PostGIS is not strictly needed for this sort of basic geospatial work.

That said, in the next post, we’ll discuss use cases which will require PostGIS, migrate our schema to support PostGIS modeling of our points, then compare performance, index sizes, etc. versus the additional concepts needing understanding in order to get basic PostGIS running.

About The Author

James was technical team lead for a rental real-estate site from the early 2000s through 2016. This geospatial series tracks a similar path to how the team first adopted primitive-but-fast GIS support within PostgreSQL proper, then later on adopting PostGIS when use cases required.