Parquet in the Stdlib — Why and How

April 24, 2026 analyticscolumnarwave-8

Parquet is the file format under almost every modern data lake — Snowflake, Databricks, Athena, DuckDB, ClickHouse's external tables. Arrow is the in-memory and IPC format that sits next to it. Between them they define how a terabyte of tabular data moves through a 2026 analytics stack.

There is no scripting language whose standard library ships either. Python has pyarrow as a ~60 MB binary wheel that vendors a pile of C++. Node has nothing. Go needs a third-party package. Rust has arrow-rs, which is excellent, but Rust is not a scripting language.

The Wave 8 "columnar lane" adds both to the Lateralus stdlib, written in pure Lateralus, in under 1,500 lines.

◉ The five primitives behind columnar

Before Parquet itself there's a family of encodings that give it its compression. Each is a small stdlib module:

run_length — [1,1,1,2,2,3] → [(1,3),(2,2),(3,1)]. Two payload modes: pairs-of-ints, and the Parquet-packed byte form.
dict_encoding — take a column of repeated strings, replace each with an integer index into a dictionary.
frame_of_reference — store a reference value and deltas from it. For monotonically-increasing timestamps, a delta-of-delta variant (the Gorilla paper) packs to near zero bytes.
bitpack — pack N-bit integers tightly into a byte stream.
xxhash / lz4 — the hashes and codecs Parquet actually uses for checksums and page compression.

If you're wondering how Parquet on a 100 GB log table compresses down to 3 GB, it's these five primitives stacked.

◉ Thrift Compact in forty lines

Parquet's footer is encoded in Apache Thrift Compact Protocol. Which sounds scary until you read the spec; it's a variable-length integer encoding with zig-zag for signed values. Here's the whole implementation from stdlib/parquet.ltl:

// Zig-zag encoding: -1 → 1, 1 → 2, -2 → 3, 2 → 4 …
pub fn thrift_zz_i32(n: int) -> int {
    return ((n << 1) ^ (n >> 31)) & 0xFFFFFFFF
}

// Unsigned varint: 7 bits per byte, high bit is "more to come"
pub fn thrift_uvarint(n: int) -> list {
    let out = []
    while n >= 0x80 {
        out = out + [(n & 0x7F) | 0x80]
        n = n >> 7
    }
    out = out + [n & 0x7F]
    return out
}

pub fn thrift_varint_i32(n: int) -> list {
    return thrift_uvarint(thrift_zz_i32(n))
}

Pinned: zz(0)=0, zz(-1)=1, zz(1)=2; uvarint(300) = [0xAC, 0x02]. Those two lines alone catch 80% of the bugs you'd make. They're in the Wave 8 tests.

◉ Arrow IPC in another ninety

Arrow's IPC (streaming) format is refreshingly simple: a continuation marker (0xFFFFFFFF), a 4-byte length prefix, a Flatbuffers message, padding to 8 bytes, and the data buffers. Validity bitmaps use LSB-first ordering — a little unusual, but fine once you've pinned a test vector:

// 9-bit mask [T F T T F F T T T] → two bytes: [0xCD, 0x01]
let mask = [true, false, true, true, false, false, true, true, true]
let bitmap = arrow_ipc_validity_bitmap(mask)
assert bitmap == [0xCD, 0x01]

UTF-8 arrays are two buffers: an offsets buffer (int32) and a concatenated data buffer. Variable-width but trivial to zero-copy read. Everything else is a small elaboration on that pattern.

◉ Building an actual Parquet file

The wave 8 example composes every module into one pipeline:

import dict_encoding
import run_length
import frame_of_reference
import arrow_ipc
import parquet

// Dictionary-encode a column of repeated city names.
let encoded = dict_encoding_encode([
    "NY", "NY", "SF", "SF", "NY", "LA", "LA", "NY"
])
// RLE the resulting integer indices.
let rle_bytes = run_length_encode_packed(encoded["indices"])

// Delta-of-delta encode a clock column.
let clocks = [1000, 1010, 1020, 1030, 1045, 1060, 1075, 1090]
let dod = frame_of_reference_encode_dod(clocks)

// Frame it as one Arrow IPC message.
let arrow_bytes = arrow_ipc_frame_message(
    arrow_ipc_buffers_utf8(encoded["dictionary"]) + rle_bytes + dod
)

// Wrap as a minimal Parquet file: PAR1 + body + footer + PAR1.
let body = arrow_bytes
let footer = parquet_data_page_v1_header(len(body), len(body), 8)
let parq = parquet_assemble_file(body, footer)

write_file("city_clocks.parq", parq)

That file opens in any Parquet reader. Not because we cheated — because we followed the spec.

◉ The honest limits

The stdlib modules cover the encoding surface completely. They don't yet cover the full schema definition language (parquet.thrift has a lot of message types), and they don't yet implement page-level statistics emission. For a log-rollup or telemetry use-case, the current surface is enough; for a general-purpose warehouse writer, it's a starting point, not a finish line.

◉ The point isn't parity

The point is: when you need to emit Parquet from a shipping language — a thing your tooling vendor told you was impossible without Arrow's C++ — it's now a three-import one-file job in Lateralus, auditable from top to bottom, transpilable to a native binary.

Every byte of the format lives in one 200-line file.

Wave 8 is one of five "spiral" releases that added these lanes in 2026. Read the lane positioning or see the stdlib comparison matrix for the bigger picture.