Parquet in the Stdlib — Why and How
Parquet is the file format under almost every modern data lake — Snowflake, Databricks, Athena, DuckDB, ClickHouse's external tables. Arrow is the in-memory and IPC format that sits next to it. Between them they define how a terabyte of tabular data moves through a 2026 analytics stack.
There is no scripting language whose standard library ships either. Python has pyarrow as a ~60 MB binary wheel that vendors a pile of C++. Node has nothing. Go needs a third-party package. Rust has arrow-rs, which is excellent, but Rust is not a scripting language.
The Wave 8 "columnar lane" adds both to the Lateralus stdlib, written in pure Lateralus, in under 1,500 lines.
◉ The five primitives behind columnar
Before Parquet itself there's a family of encodings that give it its compression. Each is a small stdlib module:
run_length—[1,1,1,2,2,3]→[(1,3),(2,2),(3,1)]. Two payload modes: pairs-of-ints, and the Parquet-packed byte form.dict_encoding— take a column of repeated strings, replace each with an integer index into a dictionary.frame_of_reference— store a reference value and deltas from it. For monotonically-increasing timestamps, adelta-of-deltavariant (the Gorilla paper) packs to near zero bytes.bitpack— pack N-bit integers tightly into a byte stream.xxhash/lz4— the hashes and codecs Parquet actually uses for checksums and page compression.
If you're wondering how Parquet on a 100 GB log table compresses down to 3 GB, it's these five primitives stacked.
◉ Thrift Compact in forty lines
Parquet's footer is encoded in Apache Thrift Compact Protocol. Which sounds scary until you read the spec; it's a variable-length integer encoding with zig-zag for signed values. Here's the whole implementation from stdlib/parquet.ltl:
// Zig-zag encoding: -1 → 1, 1 → 2, -2 → 3, 2 → 4 …
pub fn thrift_zz_i32(n: int) -> int {
return ((n << 1) ^ (n >> 31)) & 0xFFFFFFFF
}
// Unsigned varint: 7 bits per byte, high bit is "more to come"
pub fn thrift_uvarint(n: int) -> list {
let out = []
while n >= 0x80 {
out = out + [(n & 0x7F) | 0x80]
n = n >> 7
}
out = out + [n & 0x7F]
return out
}
pub fn thrift_varint_i32(n: int) -> list {
return thrift_uvarint(thrift_zz_i32(n))
}
Pinned: zz(0)=0, zz(-1)=1, zz(1)=2; uvarint(300) = [0xAC, 0x02]. Those two lines alone catch 80% of the bugs you'd make. They're in the Wave 8 tests.
◉ Arrow IPC in another ninety
Arrow's IPC (streaming) format is refreshingly simple: a continuation marker (0xFFFFFFFF), a 4-byte length prefix, a Flatbuffers message, padding to 8 bytes, and the data buffers. Validity bitmaps use LSB-first ordering — a little unusual, but fine once you've pinned a test vector:
// 9-bit mask [T F T T F F T T T] → two bytes: [0xCD, 0x01]
let mask = [true, false, true, true, false, false, true, true, true]
let bitmap = arrow_ipc_validity_bitmap(mask)
assert bitmap == [0xCD, 0x01]
UTF-8 arrays are two buffers: an offsets buffer (int32) and a concatenated data buffer. Variable-width but trivial to zero-copy read. Everything else is a small elaboration on that pattern.
◉ Building an actual Parquet file
The wave 8 example composes every module into one pipeline:
import dict_encoding
import run_length
import frame_of_reference
import arrow_ipc
import parquet
// Dictionary-encode a column of repeated city names.
let encoded = dict_encoding_encode([
"NY", "NY", "SF", "SF", "NY", "LA", "LA", "NY"
])
// RLE the resulting integer indices.
let rle_bytes = run_length_encode_packed(encoded["indices"])
// Delta-of-delta encode a clock column.
let clocks = [1000, 1010, 1020, 1030, 1045, 1060, 1075, 1090]
let dod = frame_of_reference_encode_dod(clocks)
// Frame it as one Arrow IPC message.
let arrow_bytes = arrow_ipc_frame_message(
arrow_ipc_buffers_utf8(encoded["dictionary"]) + rle_bytes + dod
)
// Wrap as a minimal Parquet file: PAR1 + body + footer + PAR1.
let body = arrow_bytes
let footer = parquet_data_page_v1_header(len(body), len(body), 8)
let parq = parquet_assemble_file(body, footer)
write_file("city_clocks.parq", parq)
That file opens in any Parquet reader. Not because we cheated — because we followed the spec.
◉ The honest limits
The stdlib modules cover the encoding surface completely. They don't yet cover the full schema definition language (parquet.thrift has a lot of message types), and they don't yet implement page-level statistics emission. For a log-rollup or telemetry use-case, the current surface is enough; for a general-purpose warehouse writer, it's a starting point, not a finish line.
◉ The point isn't parity
The point is: when you need to emit Parquet from a shipping language — a thing your tooling vendor told you was impossible without Arrow's C++ — it's now a three-import one-file job in Lateralus, auditable from top to bottom, transpilable to a native binary.
Every byte of the format lives in one 200-line file.
Wave 8 is one of five "spiral" releases that added these lanes in 2026. Read the lane positioning or see the stdlib comparison matrix for the bigger picture.