What Lidar Data Looks Like and How to Query It with DuckDB

Page content

The last couple of posts in this series described navigation algorithms: robot vacuums covering a living room floor, then self-driving taxis doing the same at city scale. In that second post, I mentioned that Waymo vehicles sweep lidar continuously to sense the road around them. That’s lidar as a real-time perception tool. There’s another side to it: aerial terrain surveys that produce large static datasets, published and publicly available, and queryable with SQL.

Most coverage of lidar focuses on what it finds (jungle ruins under forest canopy, terrain damage after a wildfire), not on what it measures or what the data model looks like. This post covers the technology itself: what lidar is, what a lidar file contains, and how to run SQL queries against a public dataset using DuckDB. If you’ve read the PostGIS geography vs. geometry post, the coordinate system concepts here will be familiar, with a third dimension added.

What Lidar Is

Lidar’s mechanism is straightforward: emit a laser pulse, measure how long the reflected light takes to return, and convert that round-trip time to a distance. Do that millions of times per second while sweeping the sensor through a range of angles, and you get a three-dimensional picture of every surface the laser hit. That collection of georeferenced points is a point cloud.

The term shows up in research literature around 1963, likely coined by analogy with radar (which dates to the 1940s). Early lidar systems were stationary instruments aimed at the atmosphere to measure particle density and cloud height. Airborne terrain mapping came later, by the 1980s and 1990s, as GPS accuracy improved enough to geo-reference the sensor position during flight and assign real-world coordinates to each return.

What makes lidar different from radar is resolution at short range. A radar pulse bounces off large objects. A lidar pulse can thread through gaps in a forest canopy and return separate reflections from the treetop, an intermediate branch, and the ground below. That multi-return capability is what makes airborne lidar useful for terrain modeling under vegetation.

The Point Cloud Data Model

Each laser return becomes a single point with several attributes.

X and Y are horizontal position in a projected coordinate system, in consistent linear units from a defined origin rather than longitude and latitude. If you’ve read the PostGIS geography vs. geometry post, you know why: degrees of longitude aren’t a stable unit of measurement since they shrink at higher latitudes, but feet and meters are. Common choices are UTM (Universal Transverse Mercator), organized in global zones, and State Plane, which trades global coverage for higher accuracy within a single state by using a tighter projection.

Z is elevation in consistent linear units above a vertical reference like NAVD88. Intensity measures the strength of the reflected signal, which varies by surface material. ReturnNumber and NumberOfReturns describe where this pulse sits in the sequence: a pulse passing through tree canopy might generate three returns, one from the canopy, one from a lower branch, and one from the ground.

Classification is the most useful attribute for analysis. The ASPRS (American Society for Photogrammetry and Remote Sensing) LAS specification defines a standard scheme: class 2 is ground, classes 3 through 5 are low, medium, and high vegetation, class 6 is buildings, class 9 is water. Most analysis workflows start by filtering on this column.

Point density varies by collection quality level: the USGS 3DEP baseline is 2 points per square meter, while QL1 surveys reach 8 points per square meter. A single tile covering one square kilometer at 8 ppsm contains around 8 million points.

The LAS/LAZ Format

LAS is an open binary standard for point cloud data. ASPRS published the first version in 2003, building on earlier proprietary formats from airborne survey vendors. The file stores a header with coordinate reference system info, scale factors, and bounding box, followed by compact binary point records.

LAZ is the compressed variant. Martin Isenburg developed the LASzip compression algorithm around 2011, and it’s now available under an open-source license. LAZ files are reduced up to 70% of the size of LAS files, which matters when a lidar survey spans hundreds of gigabytes.

Getting Public Data

North Carolina Emergency Management runs a statewide QL1/QL2 lidar program in partnership with NCDOT, USGS, FEMA, and NOAA, collecting at 8 ppsm across five phases in the early 2020s. The data is available through the NC Spatial Data Download portal.

The NC dataset uses the NAD83 North Carolina State Plane coordinate system in US survey feet, with elevations referenced to NAVD88. I’ll be using tile 209393_0.las from this collection, covering a section of northwest Durham. The USGS also publishes free lidar data nationally through The National Map if you need coverage outside North Carolina.

Durham lidar tile (209393_0.las) colored by Classification in CloudCompare

Converting to Parquet with PDAL

The PDAL (Point Data Abstraction Library) is the standard open-source tool for reading, filtering, and translating point cloud data. It was initially developed with contributions from USGS staff and others. PDAL handles LAS/LAZ natively and can write to Parquet. DuckDB does not read LAS or LAZ directly; Parquet is the handoff format between the two tools.

Install it with Homebrew:

brew install pdal

Parquet support requires the Arrow library, which the Homebrew formula includes. Verify the installation before proceeding:

pdal --drivers | grep writers.arrow

If writers.arrow appears in the output, you’re ready. The simplest conversion is then a single command:

pdal translate 209393_0.las 209393_0.parquet

PDAL infers the output format from the file extension and uses its Arrow writer to produce a Parquet file. Apache Arrow is an open-source columnar in-memory data format; Parquet is its on-disk counterpart. PDAL routes point cloud data through Arrow’s columnar representation to write the Parquet output. You now have a flat, columnar file you can query with SQL.

Exploring the Data

Both the LAZ and Parquet files are in binary format, so let’s look at them in DuckDB. If you don’t have DuckDB installed yet, the installation page covers all platforms. Let’s open it and see what we have.

SELECT
  ROUND(xyz[1], 2) AS x,
  ROUND(xyz[2], 2) AS y,
  ROUND(xyz[3], 2) AS z,
  Intensity,
  ReturnNumber,
  NumberOfReturns,
  Classification
FROM '209393_0.parquet'
LIMIT 5;
┌────────────┬───────────┬────────┬───────────┬──────────────┬─────────────────┬────────────────┐
│     x      │     y     │   z    │ Intensity │ ReturnNumber │ NumberOfReturns │ Classification │
│   double   │  double   │ double │  uint16   │    uint8     │      uint8      │     uint8      │
├────────────┼───────────┼────────┼───────────┼──────────────┼─────────────────┼────────────────┤
│ 2040000.17 │ 834993.06 │ 332.65 │     51959 │            1 │               1 │              1 │
│ 2040001.72 │ 834992.68 │ 332.39 │     48215 │            1 │               1 │              2 │
│ 2040009.48 │ 834992.77 │ 332.16 │     44938 │            1 │               1 │              2 │
│ 2040014.62 │ 834992.82 │ 332.06 │     38501 │            2 │               2 │              2 │
│ 2040021.92 │ 834992.68 │ 331.89 │     41017 │            1 │               1 │              1 │
└────────────┴───────────┴────────┴───────────┴──────────────┴─────────────────┴────────────────┘

X and Y are in NC State Plane (US survey feet). Easting is the X coordinate: distance east from the projection’s central meridian (79°W longitude for NC State Plane). Northing is the Y coordinate: distance north from the latitude of origin (33°45’N). To keep all coordinates within North Carolina positive, the projection applies a false easting of 2,000,000 feet, so the ~2,040,000 easting here means this tile is about 40,000 feet east of that meridian, and the northing of ~834,000 puts it roughly 834,000 feet north of the latitude of origin. Z is elevation in feet above NAVD88.

Row 4 is worth noting: ReturnNumber=2, NumberOfReturns=2. That point is the second return from a single laser pulse: the pulse hit something on the way down (the first return) and then hit the ground (this return, classified as ground, class 2). That two-return sequence is the multi-return mechanism in a single row. To find the matching first return, filter on the same approximate X/Y with ReturnNumber=1.

Useful Queries

What’s in this tile?

Let’s count the points by surface type to understand what the tile contains.

SELECT Classification, COUNT(*) AS num_points
FROM '209393_0.parquet'
GROUP BY Classification
ORDER BY num_points DESC;
┌────────────────┬────────────┐
│ Classification │ num_points │
│     uint8      │   int64    │
├────────────────┼────────────┤
│              5 │    4709437 │
│              1 │    4051263 │
│              2 │    2906084 │
│              4 │    1553949 │
│              3 │     208727 │
│              6 │     140372 │
│             11 │      32582 │
│              7 │       7981 │
│             18 │       1224 │
└────────────────┴────────────┘

High vegetation (class 5) dominates this tile, which covers a forested section of northwest Durham. Class 1 (unclassified) is the second-largest group, a normal residual in any collection; ground processing algorithms leave some points unresolved. Ground (class 2) comes third. Buildings (class 6) are a small slice, consistent with a mostly wooded residential tile.

Class 11 is worth noting: it is used in this NC collection for road surface and not part of the base ASPRS core codes. Class 7 is low point noise, and class 18 is high noise. Both are small residuals from the quality filtering pass.

What’s the terrain elevation in a subregion?

Next, let’s extract ground points in a bounding box. This tile covers roughly 2500 × 2500 feet. I’ll use the lower-left half.

SELECT ROUND(xyz[1], 2) AS X, ROUND(xyz[2], 2) AS Y, ROUND(xyz[3], 2) AS Z
FROM '209393_0.parquet'
WHERE Classification = 2
  AND xyz[1] BETWEEN 2040000 AND 2041250
  AND xyz[2] BETWEEN 832500 AND 833750
ORDER BY xyz[3] ASC
LIMIT 10;
┌────────────┬───────────┬────────┐
│     X      │     Y     │   Z    │
│   double   │  double   │ double │
├────────────┼───────────┼────────┤
│ 2041248.38 │ 833222.49 │ 269.13 │
│ 2041249.91 │ 833218.32 │ 269.16 │
│ 2041234.52 │  833240.7 │ 269.21 │
│ 2041207.53 │ 833254.93 │ 269.22 │
│ 2041152.35 │ 833256.98 │ 269.23 │
│ 2041208.82 │ 833255.56 │ 269.24 │
│ 2041212.88 │ 833253.16 │ 269.24 │
│ 2041242.12 │ 833228.91 │ 269.26 │
│ 2041194.04 │ 833258.64 │ 269.26 │
│ 2041211.66 │ 833254.62 │ 269.27 │
└────────────┴───────────┴────────┘

The lowest ground returns in that subregion come in around 269 feet, with values rising as the terrain climbs. The Z range tells you the local relief: subtract the minimum from the maximum and you have the vertical drop across that patch. This is a quick terrain profile, no GIS software required.

How tall are the trees and buildings?

Now let’s compare average elevation for buildings versus high vegetation.

SELECT Classification, ROUND(AVG(xyz[3]), 2) AS avg_z, ROUND(MAX(xyz[3]), 2) AS max_z
FROM '209393_0.parquet'
WHERE Classification IN (5, 6)
GROUP BY Classification;
┌────────────────┬────────┬────────┐
│ Classification │ avg_z  │ max_z  │
│     uint8      │ double │ double │
├────────────────┼────────┼────────┤
│              5 │ 342.37 │ 425.90 │
│              6 │ 330.27 │ 354.97 │
└────────────────┴────────┴────────┘

High vegetation (class 5) averages 342 feet absolute elevation, with a maximum of 425.9 feet. Buildings (class 6) average 330 feet, with a maximum of 354.9 feet. The spread between the two classifications is small, which is normal for a densely wooded residential area. This kind of comparison is useful as a quick sanity check before doing anything more involved.

How much of the canopy does the laser penetrate?

The multi-return mechanism deserves a closer look. Let’s count how many pulses produced one return versus two or three. Filtering on ReturnNumber = 1 counts each pulse exactly once.

SELECT
  NumberOfReturns,
  COUNT(*) AS pulse_count
FROM '209393_0.parquet'
WHERE ReturnNumber = 1
GROUP BY NumberOfReturns
ORDER BY NumberOfReturns;
┌─────────────────┬─────────────┐
│ NumberOfReturns │ pulse_count │
│      uint8      │    int64    │
├─────────────────┼─────────────┤
│               1 │     4019542 │
│               2 │     2726436 │
│               3 │     1038807 │
│               4 │      221586 │
│               5 │       29247 │
│               6 │        2175 │
│               7 │          91 │
│               8 │           5 │
└─────────────────┴─────────────┘

About half the pulses in this tile produce a single return. The other half thread through the canopy and generate two or more. At QL1 density (8 ppsm), that split is meaningful: the collection is dense enough that a large share of pulses find a gap. In a sparser survey the single-return fraction would be higher. Pulses reaching four or five returns are threading through multiple canopy layers before hitting the ground.

How thick is the canopy?

To measure how deep those penetrating pulses actually go, compare the average Z of first returns versus last returns across all multi-return pulses.

SELECT
  CASE WHEN ReturnNumber = 1 THEN 'first' ELSE 'last' END AS position,
  ROUND(AVG(xyz[3]), 2) AS avg_z,
  COUNT(*) AS point_count
FROM '209393_0.parquet'
WHERE NumberOfReturns > 1
  AND (ReturnNumber = 1 OR ReturnNumber = NumberOfReturns)
GROUP BY position
ORDER BY position;
┌──────────┬────────┬─────────────┐
│ position │ avg_z  │ point_count │
│ varchar  │ double │    int64    │
├──────────┼────────┼─────────────┤
│ first    │ 333.15 │     4018347 │
│ last     │ 293.46 │     3998982 │
└──────────┴────────┴─────────────┘

First returns average 333 feet; last returns average 293 feet. That 40-foot gap is the average vertical depth the laser traveled through vegetation before hitting a lower surface. It lines up well with the mature hardwood canopy in this tile: the first returns cluster near the vegetation avg_z of 342 feet, and the last returns cluster near the ground avg_z of around 296 feet.

Which surfaces reflect the laser most strongly?

Different surface materials reflect lidar with different intensity. Pavement and building roofs tend to be strong reflectors; dense wet vegetation absorbs more of the pulse. Let’s check average intensity by classification for the main surface types.

SELECT
  Classification,
  ROUND(AVG(Intensity), 0) AS avg_intensity,
  ROUND(STDDEV(Intensity), 0) AS stddev_intensity,
  COUNT(*) AS point_count
FROM '209393_0.parquet'
WHERE Classification IN (2, 5, 6, 11)
GROUP BY Classification
ORDER BY Classification;
┌────────────────┬───────────────┬──────────────────┬─────────────┐
│ Classification │ avg_intensity │ stddev_intensity │ point_count │
│     uint8      │    double     │      double      │    int64    │
├────────────────┼───────────────┼──────────────────┼─────────────┤
│              2 │       40071.0 │          17075.0 │     2906084 │
│              5 │        8535.0 │          12027.0 │     4709437 │
│              6 │       20676.0 │          14893.0 │      140372 │
│             11 │       23611.0 │           8379.0 │       32582 │
└────────────────┴───────────────┴──────────────────┴─────────────┘

Ground returns (class 2) have the highest average intensity at 40,071, nearly double that of road surfaces (23,611) and buildings (20,676). NC Piedmont soils are iron-rich red clay, and that mineral composition likely contributes to the intensity. Vegetation (class 5) sits at the opposite end, averaging 8,535, with a standard deviation of 12,027 that exceeds the mean. That spread indicates a highly multimodal distribution: leaf angle, moisture, and species all affect how much of the pulse returns, and they vary enormously across 4.7 million canopy points. Road surface (pavement) has the most consistent intensity of any class here, with a stddev of 8,379 against a mean of 23,611.

What does the terrain look like across the tile?

Finally, let’s produce a coarse elevation model by binning ground returns into 100-foot grid cells and averaging Z within each cell. This is the DEM query: no GIS software, no raster format, just SQL.

SELECT
  (FLOOR(xyz[1] / 100) * 100)::INTEGER AS cell_x,
  (FLOOR(xyz[2] / 100) * 100)::INTEGER AS cell_y,
  ROUND(AVG(xyz[3]), 2) AS avg_ground_z,
  COUNT(*) AS ground_points
FROM '209393_0.parquet'
WHERE Classification = 2
GROUP BY cell_x, cell_y
ORDER BY cell_x, cell_y
LIMIT 10;
┌─────────┬────────┬──────────────┬───────────────┐
│ cell_x  │ cell_y │ avg_ground_z │ ground_points │
│  int32  │ int32  │    double    │     int64     │
├─────────┼────────┼──────────────┼───────────────┤
│ 2040000 │ 832500 │        321.5 │          5461 │
│ 2040000 │ 832600 │       322.21 │          5322 │
│ 2040000 │ 832700 │       326.27 │          4889 │
│ 2040000 │ 832800 │       331.33 │          5002 │
│ 2040000 │ 832900 │       331.87 │          4038 │
│ 2040000 │ 833000 │        325.9 │          5067 │
│ 2040000 │ 833100 │        325.7 │          5265 │
│ 2040000 │ 833200 │       331.08 │          4907 │
│ 2040000 │ 833300 │       332.67 │          5446 │
│ 2040000 │ 833400 │       330.32 │          4525 │
└─────────┴────────┴──────────────┴───────────────┘

Each row is one 100-foot grid cell. The first ten cells, all at the western edge of the tile, show terrain rising from 321 feet at the southwest corner to around 332 feet a few hundred feet north, then dropping back slightly. That’s about 11 feet of local relief across 1,000 feet of northing, a gentle slope consistent with the Piedmont. Ground point counts are uniformly high at 4,000 to 5,500 per cell, which is what you’d expect from a QL1 collection: even under dense canopy, enough pulses reached the ground to produce reliable elevation averages. In a sparser collection, cells under heavy cover would show much lower counts and noisier Z values. Export the full result to CSV and load it into QGIS or any mapping tool that accepts a point layer with Z values to see the terrain across the whole tile.

When to Use This Approach

Use the DuckDB approach when you want to explore a lidar tile quickly: run aggregations, filter down to a classification subset, or join point data to another dataset without standing up a database server. This process highlights one of the advantages of DuckDB: no schema to define, no daemon to start. For ad hoc exploration, this is perfect.

Don’t use it for production queries over large datasets. DuckDB’s spatial extension has no spatial index, so every query scans the full table. A single tile is manageable. A county-scale lidar survey with hundreds of millions of points is not.

For larger stored datasets, pgPointCloud adds point cloud storage to PostgreSQL with compression and patch-based spatial indexing. PDAL’s own pipeline system handles more complex workflows: ground classification, surface interpolation, thinning, and multi-tile processing. For deep 3D work (building reconstruction, mesh generation, autonomous vehicle perception pipelines), domain-specific tooling has far richer support.

What Lidar Data Is Actually Used For

Self-propelling robots get the press around lidar. But flood modeling, wildfire spread prediction, and infrastructure planning account for far more lidar data volume, and the USGS 3DEP program was funded primarily for these applications. Forest inventory and timber volume estimation are major use cases. Archaeologists have used airborne lidar to reveal structures hidden under jungle canopy: stripping out the vegetation returns and examining the bare-earth model has identified dozens of Maya sites in the Yucatan and Guatemala that were invisible in aerial photography. Coastal agencies use it to track erosion and storm surge damage. Urban planners use rooftop returns to model solar panel potential at neighborhood scale.

The data structure is the same in all these cases. A point cloud is a point cloud. Once it’s in Parquet it’s a wide, flat table with X, Y, Z, and a handful of integer attributes. The spatial questions you’d ask of any 2D dataset (bounding box crops, classification aggregations, comparisons between surface types) apply directly here, just with a Z column that now matters. DuckDB makes that first round of exploration fast and easy.

Next Steps

The PDAL documentation covers the full pipeline API, including batch processing, custom filter chains, and the complete list of supported formats. The NC Spatial Data Download portal is the starting point for the North Carolina statewide collection; the statewide lidar program overview describes the five collection phases and the quality levels. The USGS 3DEP program page covers the national dataset.