How PostGIS Finds Your Nearest Neighbor (Without Scanning Every Row)

How PostGIS Finds Your Nearest Neighbor (Without Scanning Every Row)
Page content

Finding the nearest X is easy to ask for. Getting it right at scale is another matter.

A lot of “find the nearest X” features (the coffee shop around the corner, the available driver closest to your pickup, the hospital within reach during an emergency) use a k-nearest neighbor (KNN) query under the hood. In PostGIS, that’s the <-> distance operator. On a small table it works fine without any special setup. On a table with millions of rows, without the right index, it silently becomes a full-table sort in disguise. The query looks fast until it isn’t.

Here’s how the operator works, where it came from, and what the GiST index does to make it fast.

The <-> Operator

PostGIS registers <-> as a distance ordering operator. To follow along, here’s a sample table and 1,000 (fictional) stores:

CREATE TABLE stores (
  store_id  SERIAL PRIMARY KEY,
  name      TEXT NOT NULL,
  location  GEOGRAPHY(POINT, 4326) NOT NULL
);

INSERT INTO stores (name, location)
SELECT
  'Store ' || generate_series,
  ST_MakePoint(-79 + random()*0.4, 36 + random()*0.4)::geography
FROM generate_series(1,1000)
ON CONFLICT DO NOTHING;

Let’s look at the KNN query:

WITH origin AS (
  SELECT
    -78.898613 AS durham_lon,
    35.994033  AS durham_lat
)
SELECT
  s.name,
  ST_AsText(s.location) AS coords,
  s.location <-> ST_MakePoint(o.durham_lon, o.durham_lat)::geography AS distance
FROM stores s, origin o
ORDER BY distance
LIMIT 5;

You’re asking PostgreSQL to sort all rows by their distance from the Durham, NC coordinates and return the five closest. The distance column is in meters, because the geography type computes distances on a spheroid rather than in the flat-plane units of the coordinate system.

What Happens Without an Index

Run EXPLAIN on the query above against an unindexed table and you’ll see a sequential scan followed by a sort. PostgreSQL has no choice: it computes the distance from every row to your query point, materializes the result, sorts it, and hands back the top five. The work grows linearly with table size. At a few thousand rows this is imperceptible. At a few million it becomes the slow query your on-call rotation gets paged about at 2 a.m.

                                QUERY PLAN
---------------------------------------------------------------------
 Limit  (cost=286.61..291.05 rows=5 width=49)
   ->  Result  (cost=286.61..1174.11 rows=1000 width=49)
         ->  Sort  (cost=286.61..289.11 rows=1000 width=49)
               Sort Key: ((s.location <-> '0101000020E6100000ABB019E082B953C088122D793CFF4140'::geography))
               ->  Seq Scan on stores s  (cost=0.00..270.00 rows=1000 width=49)

PostgreSQL computes the distance from every row to your query point, sorts the full result, and only then hands back the top five.

R-Trees, GiST, and Why They Exist

The solution predates PostGIS by decades. In 1984, Antonin Guttman published “R-Trees: A Dynamic Index Structure for Spatial Searching” in the proceedings of SIGMOD. The problem he was solving: relational databases had good index structures for one-dimensional data (B-trees), but nothing that handled multi-dimensional spatial data efficiently. His R-tree organized spatial objects into a hierarchy of minimum bounding rectangles (MBRs). You could prune entire branches of the tree during a search if a branch’s bounding rectangle was too far from your query region to contain a relevant result.

PostgreSQL’s GiST (Generalized Search Tree) framework, introduced by Joe Hellerstein, Jeff Naughton, and Avi Pfeffer in their 1995 VLDB paper, extended this idea. Rather than building one specialized index for spatial data, they built a pluggable index infrastructure where any data type could define its own tree structure by implementing a small set of methods (consistent, union, penalty, picksplit, and a few others). PostGIS’s geometry and geography types plug into GiST by implementing those methods around an R-tree layout.

How the GiST Index Stores Spatial Data

When you create a GiST index on a geography column, PostgreSQL builds a tree where each leaf node stores the MBR of one location, and each internal node stores the MBR that is the union of all its child MBRs. The result is a nested set of bounding rectangles that gets progressively more precise as you move down the tree toward individual rows.

During a KNN search, PostgreSQL works through the tree starting at the top. At each level, it looks at the bounding boxes of the child nodes and asks: “could anything inside this box be closer than what I’ve already found?” If the answer is no (because even the nearest corner of the box is farther away than the current best result) the entire subtree gets skipped. PostgreSQL never looks at the individual rows inside it. If the answer is yes, it opens that box and repeats the process one level deeper. By the time it reaches the leaf nodes, it has only examined the branches that could have contained a closer result.

Index-Assisted ORDER BY … LIMIT

Create the index: and run EXPLAIN again.

CREATE INDEX stores_location_idx ON stores USING GIST (location);

And run EXPLAIN again.

CREATE INDEX stores_location_idx ON stores USING GIST (location);

                               QUERY PLAN
-----------------------------------------------------------------------
 Limit  (cost=0.14..4.98 rows=5 width=49)
   ->  Index Scan using stores_location_idx on stores s  (cost=0.14..968.64 rows=1000 width=49)
         Order By: (location <-> '0101000020E6100000ABB019E082B953C088122D793CFF4140'::geography)

The Sort node is gone. The index scan itself produces rows in distance order, and the Limit node can stop as soon as it has five of them. Most of the table is never visited.

When to Use KNN

Use <-> when you have a fixed number of results to return and proximity is the ranking criterion. Examples: “Find the 10 nearest drivers,” “Show the 3 closest urgent care clinics,” or “Return the nearest 20 products in my warehouse by bin location.”

It is not the right tool when you want everything within a fixed radius. For radius queries, ST_DWithin is both semantically clearer and more efficient. It uses the GiST index as a filter rather than an ordering operator, pruning branches whose MBRs don’t intersect the search circle, and it stops as soon as the index confirms inclusion or exclusion. That is, it doesn’t need to rank anything.

-- Prefer this for "all stores within 5 km":
WITH origin AS (
  SELECT
    -78.898613 AS durham_lon,
    35.994033  AS durham_lat
)
SELECT
  name,
  ROUND((location <-> ST_MakePoint(durham_lon, durham_lat)::geography)::numeric) AS distance_m
FROM stores, origin
WHERE ST_DWithin(location::geography, ST_MakePoint(durham_lon, durham_lat)::geography, 5000)
ORDER BY distance_m;

-- Not this (takes about 3x longer)
WITH origin AS (
  SELECT
    -78.898613 AS durham_lon,
    35.994033  AS durham_lat
)
SELECT name, distance_m
FROM (
  SELECT
    name,
    ROUND((location <-> ST_MakePoint(durham_lon, durham_lat)::geography)::numeric) AS distance_m
  FROM stores, origin
  ORDER BY distance_m
) sub
WHERE distance_m <= 5000;

The index also helps less when your query point is far from all data, like querying a US store table from a point in Europe. The tree traversal still works correctly, but the pruning is less effective because more branches have overlapping MBRs relative to a distant query point.

For very small tables (a few hundred rows), you can skip the index entirely. The sequential scan is fast enough, and the index overhead isn’t worth it.

The Operator and the Index Are a Pair

The <-> operator is how you express “rank by distance.” But expressing it correctly is only half the story. The GiST index is what transforms that expression from a full-table sort into a tree traversal that visits a fraction of the data. Neither is sufficient alone: the operator without the index is a slow sort; the index without the operator provides spatial filtering but not ordering.

Guttman’s R-tree insight, that you could prune spatial branches using bounding-rectangle lower bounds, turned out to be durable enough that it’s still the mechanism underneath a PostGIS query you run today. The GiST framework just made it available to any type that could define the right set of methods. The result is that a nearest-neighbor query over millions of locations can return in milliseconds, not minutes.

If you’re using proximity searches in an interesting way, or if you hit a case where the index didn’t help as much as you expected, drop me an email. I’d love to hear about it.