DuckDB Data Master — Scott Baker

// Sovereign Benchmark · April 2026

167 Million Rows. 4 Years.
971 Milliseconds.

48 Apache Parquet files — every NYC Yellow Cab trip from January 2022 through December 2025. Five analytical queries. Cold NVMe. One consumer workstation. No JVM. No executor heap. No cluster. No warmup.

167,858,646

rows scanned

971ms

wall time · 5 queries

172M/s

rows/sec · cold NVMe

25×

faster than Spark
(50GB RAM heap, pre-warmed)

Query	Description	Time
Q1 — Row Count	COUNT(*) per year across all 4 years	29ms
Q2 — Fare & Distance YoY	6 aggregates (avg fare, distance, tip, total, passengers, trips) per year · 143M filtered rows · full column decompression	401ms
Q3 — Monthly Pivot	Trip volume by year×month — 48 cells · native DuckDB PIVOT · 2025 hottest year on record	312ms
Q4 — Payment Type Shift	Cash collapse: 19.6% → 9.6% · Credit card peak 2023 at 77.9% · full 4-year scan	158ms
Q5 — CBD Congestion Fee	2025-only schema column · NYC congestion pricing live Jan 2025 · 72.8% of trips charged · $25.03M captured	64ms
Total	167,858,646 rows · 48 Parquet files · ~685M total row-scans across all 5 queries combined	971ms

Hardware: Intel Ultra 7 265KF (20-core) · NVIDIA RTX 3060 12GB · NVMe SSD · Arch Linux · 64GB RAM
Stack: DuckDB 1.4.4 vectorized columnar execution · zero Python · zero JVM · cold NVMe reads · one process
Spark baseline: Apache Spark on identical hardware, 50 GB executor heap pre-warmed in RAM — 6.88M rows/sec. DuckDB delivers 172M rows/sec from cold NVMe — 25× faster with no heap, no warmup, no cluster.

What "~685M row-scans" means: The dataset has 167,858,646 unique rows. Each of the 5 queries independently scans all 4 years, so DuckDB physically decompressed and processed roughly 685M column value reads total across the run. The 172M rows/sec figure uses unique rows ÷ wall time — the conservative number. The engine is doing significantly more work than that implies.

Important: these files were on local NVMe — not S3. S3 throughput is 100–500 MB/s vs NVMe at 3–7 GB/s. Expect 5–20× slower over object storage. See the S3 vs NVMe section below for the full honest breakdown and what to do about it.

// How It Scales

DuckDB is CPU + NVMe bound. The workstation number is measured and verified. Bare-metal and cloud NVMe figures are informed estimates based on published hardware specs — not yet run.

Configuration	Hardware	DuckDB Throughput	Est. Monthly Cost
Workstation (measured ✓)	Intel Ultra 7 265KF · NVMe · 64GB RAM	172M rows/sec · 971ms	$0 (owned)
Hetzner AX102 bare-metal	AMD Ryzen 9950X · NVMe · 192GB RAM	~200–250M rows/sec (est.)	~$250/mo
AWS i4i.4xlarge	Intel Xeon · 3.75TB local NVMe SSD · 128GB RAM	~250–350M rows/sec (est.)	~$900/mo on-demand
Databricks cluster	Spark · JVM heap · DBU licensing · S3	~6–40M rows/sec (measured Spark baseline: 6.88M)	$5,000–$25,000/mo

Workstation result is verified — measured 2026-04-08 on local NVMe, cold reads, no warmup. Hetzner and AWS i4i estimates based on published NVMe throughput specs vs measured workstation baseline. Spark baseline of 6.88M rows/sec measured on identical workstation hardware, 50GB heap pre-warmed.

🏆

Databricks Certified

Associate Developer · Spark · Scala · Verify ↗

☁️

AWS Solutions Architect

Associate — Amazon Web Services · Verify ↗

🔷

Azure Data Platform

Blob Storage · ADLS Gen2 · Synapse

🌐

Google Cloud Platform

GCS · BigQuery · Dataflow

// HOW IT WORKS

Your data. My compute. No cluster tax.

You grant scoped read access to your cloud storage. I run DuckDB on a high-performance node, deliver results as Parquet, CSV, or a live dashboard — then recommend the right long-term architecture for your data volume and budget.

YOUR DATA (stays in your cloud) AWS S3 s3://your-bucket/data/*.parquet Azure Blob abfss://container@account.dfs.core.windows.net/ GCP Storage gs://your-bucket/data/*.parquet │ │ scoped read-only credentials │ ▼ COMPUTE NODE (bare-metal NVMe or NVMe cloud instance) DuckDB + httpfs / azure / gcs extension ┌──────────────────────────────────────────────────────┐ │ LOAD httpfs; SET s3_region='us-east-1'; │ │ SET s3_access_key_id='...'; │ │ │ │ SELECT month, SUM(revenue), AVG(margin) │ │ FROM read_parquet('s3://your-bucket/sales/*.parquet'│ │ GROUP BY month ORDER BY month; │ └──────────────────────────────────────────────────────┘ │ │ results ▼ DELIVERABLES ├── results.parquet → written back to your bucket ├── report.csv → emailed / shared directly └── dashboard → Streamlit app, hosted anywhere

Honest take on S3 as a data lake
S3 object storage tops out at roughly 100–500 MB/s per connection. A local NVMe drive delivers 3,000–7,000 MB/s — a 10–50× I/O advantage. DuckDB's httpfs uses columnar predicate pushdown to minimize what it downloads from S3, which helps significantly, but physics still wins: if your dataset is large and your queries are wide, S3 is the bottleneck, not DuckDB.

The 172M rows/sec benchmark ran against local NVMe. Against S3 with the same hardware and the same queries, expect 5–20× slower depending on file sizes, network, and how selective your queries are.

That said — you still save money over Databricks. Even at 5× slower, a $200/month bare-metal NVMe node running DuckDB beats a $5,000/month Databricks cluster reading the same S3 data. The recommendation depends on your data volume and query patterns.

Recommendation: transition hot data to NVMe
For recurring analytical workloads — daily reports, ML feature pipelines, compliance queries — the right architecture is: S3/Blob/GCS for cold archival storage, NVMe for hot analytical compute. Sync your hot dataset once to a bare-metal or NVMe cloud instance, process at full speed, write results back to object storage. You get S3 durability for your source data and NVMe throughput for your queries.

Bare-metal options worth knowing:
· Hetzner AX102 — 192GB RAM, NVMe, dedicated, ~$250/month (exceptional value)
· AWS i4i instances — up to 30TB local NVMe SSD, purpose-built for analytics
· GCP Local SSD — up to 9TB, 2.4M IOPS, available on most instance families
· Azure Lsv3 — NVMe local storage, built for storage-intensive workloads

I can audit your current workload and tell you exactly which tier makes sense.

// WHY IT MATTERS

Your Databricks bill is optional.

Most analytics workloads at mid-market companies fit on a single modern server. DuckDB processes columnar Parquet in-process — no serialization, no cluster coordination, no $15k/month bill.

Capability	Databricks / Spark	DuckDB (single node)
685M row scan + 5 queries	$8–40 cluster cost per run	$0 — local process
Read from S3 / Azure / GCP	Yes (cluster required)	Yes — httpfs / azure / gcs extension, no cluster
Monthly platform cost	$3,000–$25,000+	$20–200 (VPS or local workstation)
Time to first query	5–15 min (cluster startup)	< 1 second
SQL compatibility	SparkSQL (HiveQL dialect)	Standard SQL + PIVOT, ASOF JOIN, LIST agg, UNPIVOT
Python / Streamlit integration	PySpark (heavyweight)	Native Python API — `duckdb.query(sql).df()`
Operational complexity	Cluster mgmt, DBUs, autoscale, networking	Zero — one binary, embed anywhere

// THE CODE WE ACTUALLY RAN

Plain SQL. Verifiable results.

Five queries. Real data from the NYC TLC open dataset. No proprietary runtime. Run it yourself.

Read directly from S3 — no download

LOAD httpfs;
SET s3_region='us-east-1';
SET s3_access_key_id='...';
SET s3_secret_access_key='...';

SELECT year, COUNT(*) AS trips, ROUND(AVG(fare_amount),2) AS avg_fare
FROM   read_parquet('s3://your-bucket/taxi/yellow/2024-*.parquet')
GROUP BY year ORDER BY year;

Q3 — Monthly trend pivot (DuckDB-native PIVOT)

PIVOT (
  SELECT year, MONTH(tpep_pickup_datetime) AS month,
         COUNT(*) AS trips
  FROM all_years
  GROUP BY year, month
) ON year USING SUM(trips) GROUP BY month ORDER BY month;
-- Native PIVOT — no Spark workarounds, no conditional aggregation hacks

Full benchmark: scripts/benchmark_nyc_4year.sh

// WHAT I DO

Data engineering that ships.

Remote consulting. Fixed-scope engagements. AWS, Azure, and GCP. Results you can verify.

🔍 Cost Audit

Review your Databricks / Synapse / BigQuery spend. Identify workloads that move to DuckDB on a single node. Written cost-reduction plan within 48 hours.

🦆 DuckDB Pipeline Build

End-to-end columnar pipeline — ingest from S3, Azure Blob, or GCP Storage, transform with DuckDB SQL, write results back to your bucket. Reproducible scripts you own.

📊 Streamlit Dashboard

Interactive analytics dashboard backed by DuckDB. Reads live from your cloud storage. Deployed on a $20/month VPS or your existing infra. Zero cluster dependency.

☁️ AWS Data Architecture

S3 + Glue + Athena + DuckDB hybrid pipelines. Leveraging AWS Solutions Architect experience to build data lakes that scale without surprise bills.

🔷 Azure Data Engineering

ADLS Gen2, Azure Blob, Synapse, and DuckDB integration. Migrate heavyweight Spark jobs to single-node DuckDB where the data volume allows it.

🌐 GCP / BigQuery Consulting

GCP Cloud Storage + DuckDB pipelines. BigQuery cost reduction — identify queries that run faster and cheaper outside BigQuery on a local DuckDB node.

🏗️ Spark Migration

Port SparkSQL jobs to standard DuckDB SQL. Remove cluster dependency for batch workloads. Works with AWS EMR, Azure HDInsight, and GCP Dataproc migrations.

📐 Architecture Review

Async review of your current data pipelines — any cloud. Written recommendations with specific, actionable improvements. Delivered in 48 hours.

⚡ Bare-Metal / NVMe Migration

Audit your S3 data lake for workloads that belong on NVMe. Design the hot/cold split — NVMe for compute, object storage for archival. Get DuckDB throughput you can actually feel.

🔐 Post-Quantum Secured Deliverables

Every file I deliver — Parquet, CSV, report — is signed with ML-DSA-65 (NIST FIPS 204), the post-quantum digital signature standard. You get the data file, a detached .sig file, and a one-command verifier. Run it against my public key and know the file is authentic and untampered.

Designed for healthcare, finance, and government workloads that need provable chain-of-custody today and quantum resistance tomorrow.

↓ Download duckpqc.pub — Scott Baker's ML-DSA-65 public key

Single-node analytics.
No Spark. No cluster bill.

The data. All of it. Interactive.

Your data. My compute. No cluster tax.

Your Databricks bill is optional.

Inventor & Creator — skr8tr

Plain SQL. Verifiable results.

Data engineering that ships.

🔍 Cost Audit

🦆 DuckDB Pipeline Build

📊 Streamlit Dashboard

☁️ AWS Data Architecture

🔷 Azure Data Engineering

🌐 GCP / BigQuery Consulting

🏗️ Spark Migration

📐 Architecture Review

⚡ Bare-Metal / NVMe Migration

🔐 Post-Quantum Secured Deliverables

Ready to cut your data costs?

Single-node analytics.No Spark. No cluster bill.

The data. All of it. Interactive.

Your data. My compute. No cluster tax.

Your Databricks bill is optional.

Inventor & Creator — skr8tr

Plain SQL. Verifiable results.

Data engineering that ships.

🔍 Cost Audit

🦆 DuckDB Pipeline Build

📊 Streamlit Dashboard

☁️ AWS Data Architecture

🔷 Azure Data Engineering

🌐 GCP / BigQuery Consulting

🏗️ Spark Migration

📐 Architecture Review

⚡ Bare-Metal / NVMe Migration

🔐 Post-Quantum Secured Deliverables

Ready to cut your data costs?

Single-node analytics.
No Spark. No cluster bill.