The 1B-Key Benchmark — Methodology & Raw Data

Why we benchmark

If you can't reproduce it, it didn't happen.

Most database benchmarks are marketing in a lab coat. They run on hand-tuned hardware, with cherry-picked workloads, and they never release the harness. We did the opposite — and we'll send you the test rig.

VeltrixDB sells itself on three numbers — 4.9ms P99, 1 billion keys, 2M ops/sec per node. Each one is a claim that breaks if you change the hardware or the workload. So before we go any further: every number on the performance page traces back to a Prometheus scrape, a flamegraph, or a histogram bucket we'd be happy to walk you through line by line.

This article documents the exact methodology — the cluster topology, the kernel tuning, the workload generator, the percentile distributions, and the surprises we found along the way. By the end, you should be able to set up your own rig in an afternoon and reproduce every number we publish to within ±3%.

If anything below smells like a marketing benchmark, we want to hear about it. Send us your questions — we'll either fix the harness or write a follow-up explaining why a number is the way it is.

Hardware

Commodity NVMe, no exotic silicon.

We deliberately picked the most boring infrastructure the cloud has to offer. If the numbers required Optane or custom FPGAs, they wouldn't be reproducible — and they wouldn't be cheap.

The cluster runs on Google Cloud N2 instances with local NVMe SSDs. Local NVMe is the only storage class we recommend for production VeltrixDB — EBS, Azure managed disks, and any other network-attached storage will break the latency model entirely. That's not us being precious; it's physics.

Instance type	n2-standard-64-lssd · 64 vCPU, 256 GB RAM, 8 × 375 GB local NVMe
Cluster topology	3 nodes, single region (us-central1), three availability zones, anti-affinity enforced
Storage layout	8 NVMe drives per node, `mdadm` RAID-0 striped, `ext4` with `noatime,nodiratime,discard=async`
Kernel	Linux 6.1.0-21-cloud-amd64 · io_uring SQPOLL · AVX2 · transparent hugepages set to `madvise`
Network	gVNIC · 100 Gbps placement group · jumbo frames enabled (MTU 8896)
VeltrixDB build	v0.9.4-beta · commit `cb44a1f` · 1024 shards, 256 GB LIRS cache, group-commit WAL fsync
Workload host	4 × n2-standard-32 instances generating load in parallel, same region, same placement group

Total cluster spend at on-demand pricing: ~$5,800/month. For comparison, DynamoDB at the same blended ops rate would have invoiced around $28,500/month for the same 30-day window — and that's before storage and egress.

Workload

Realistic, not flattering.

A benchmark that runs only point reads against a fully-cached working set will lie to you. Ours doesn't — we keep the writes coming, the cache pressure real, and the key distribution Zipfian.

We use a 70/30 read/write mix with Zipfian key access (α = 0.99) — the same distribution you see in production session stores, ad-tech profile lookups, and rate-limiter buckets. About 6% of reads miss the cache and fall through to NVMe, which is the worst case we care about.

Workload profile

Key space · 1,000,000,000 unique 16-byte keys
Value size · 128 bytes (Zipfian P95: 256 B) — representative of session/profile blobs
Read/write mix · 70% GET, 25% SET, 5% DEL
Access distribution · Zipfian α = 0.99 (heavy tail, realistic)
Concurrency · 1024 client connections per load host × 4 hosts = 4,096 total
Protocol · TCP binary, pipelined, no batching tricks
Warm-up · 30 minutes — long enough for the LIRS cache to converge
Measurement window · 60 minutes of steady state, sampled at 1Hz

What we did not do. We did not pre-warm the cache with a "perfect" hit pattern, we did not pin clients to specific shards, and we did not disable background GC for the measurement window. The numbers below are with compaction and garbage collection running at full throttle, exactly as they would in production.

Load generator

The harness is a Rust-based generator that talks the VeltrixDB binary protocol natively. It records every operation's wall-clock latency, the kernel's tcp_rtt at submission, and the cache-hit flag returned by the server. Histograms are HdrHistogram-based and merged across all four load hosts every 10 seconds.

load-gen.toml · harness config

# 1B-key benchmark · 70/30 mix · Zipfian α=0.99
[cluster]
endpoints  = ["10.142.0.11:7100", "10.142.0.12:7100", "10.142.0.13:7100"]
protocol   = "binary"
keepalive  = true

[workload]
key_space     = 1_000_000_000
value_size_b  = 128
distribution  = "zipfian"
zipf_alpha    = 0.99
read_pct      = 70
write_pct     = 25
delete_pct    = 5

[concurrency]
connections   = 1024
pipeline_depth = 16

[run]
warmup_min    = 30
measure_min   = 60
sample_hz     = 1

Results

The full percentile distribution.

We don't publish an average. Averages hide tail latency, and tail latency is where databases earn — or lose — their keep.

Every number below is the merged percentile across the full 60-minute measurement window, sampled at 1Hz, with both load generators running. P99 is the headline. P99.9 is the number your SREs will actually page on.

VeltrixDB · 70/30 mix · 1B keys · cache-miss path Lower is better · ms

P50

0.42 ms

P90

1.12 ms

P95

1.94 ms

P99

4.9 ms

P99.9

8.7 ms

P99.99

14.2 ms

Max

22.8 ms

Sustained throughput

6.2M ops/s

across 3 nodes · 70/30 mix

Cache hit rate

94.2%

post-warmup · steady state

Write amplification

1.0×

measured, not estimated

Cache-hit reads — the easy case

About 94% of reads in steady state never touch NVMe. They're served from the LIRS cache in under 0.3 ms end-to-end. The CDF is so steep here it's almost not worth charting — P99 cache-hit latency is 0.31 ms, P99.9 is 0.42 ms.

Cache-miss reads — the only number that matters

The interesting number is what happens when the LIRS cache misses and we have to go to disk. io_uring SQPOLL + O_DIRECT + 1024 dedicated NVMe submission queues keep P99 at 4.9 ms even when 6% of reads fall through and compaction is actively churning in the background.

Writes — the durability path

Group-commit WAL fsync amortizes the cost of durability across concurrent writers. P99 write latency was 0.21 ms, of which roughly ~80 µs is the actual fsync. Disk fsync latency is the floor — we just made sure not to add anything on top of it.

How to reproduce

Set up your own rig in an afternoon.

Every detail you need to validate these numbers on your own hardware — the Helm values, the load generator, the kernel sysctls. Email us and we'll ship the binaries.

You'll need a cloud account, three n2-standard-64-lssd instances, a Linux ≥ 5.10 kernel, and about three hours of patience. The full harness lives in a private repo today — drop us a note and we'll add you to it.

Step 1 — bring up the cluster

The Helm chart sets up the StorageClass, StatefulSet, anti-affinity rules, and ServiceMonitor in a single command. With the right values file, you're 94 seconds from a healthy 3-node cluster:

~/veltrixdb · zsh

$ helm repo add veltrixdb https://charts.veltrixdb.com
"veltrixdb" has been added to your repositories

$ helm install bench veltrixdb/veltrixdb \
    --namespace veltrixdb --create-namespace \
    -f bench-values.yaml

✓ StorageClass · veltrixdb-nvme
✓ StatefulSet · 3 replicas, anti-affinity
✓ ServiceMonitor · 50+ metrics
✓ PodDisruptionBudget · minAvailable 2
✓ Ready in 94s

Step 2 — tune the host

VeltrixDB needs a few host-level tunings to deliver these numbers. The Helm chart applies the pod-level pieces automatically; the node-level ones you have to set yourself:

Transparent hugepages · echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
Swap · off · the ART index lives in RAM and must not be paged
CPU governor · performance on all cores — don't let the kernel race-to-idle on tail latency
NUMA · pin the VeltrixDB pod to a single NUMA node when possible
Kernel · ≥ 5.10 for full io_uring SQPOLL support

Step 3 — load 1 billion keys

The harness ships with a populator that streams 1B keys at ~3M writes/sec sustained. Total fill time is around 5.5 minutes on our reference cluster — well under coffee break.

Step 4 — measure

Run the workload for 30 minutes warm-up, then 60 minutes measurement. The harness prints the merged HdrHistogram at every 10-second tick. If your P99 is more than 5% off from 4.9 ms, something is wrong — open the Grafana dashboard the chart installs at /grafana and check the per-shard tail latency.

✓

We'll help. If you set this up and the numbers don't match, that's a bug we want to fix. Email us with your hardware spec and the histogram output — we will turn around a root-cause analysis within 48 hours.

Harness: veltrixdb/bench-harness (private) · Hardware: GCP n2-standard-64-lssd × 3 · Snapshot: May 21, 2026 · 14:42 UTC

See the chart Read the architecture Request harness access

Continue reading

Architecture deep-dive

Key-value separation & why compaction shouldn't touch values

How the VLog, io_uring, and LIRS cache combine to keep P99 stable at any write load.

Engineering blog · 18 minRead →

Migration guide

Moving from DynamoDB to VeltrixDB in two sprints

Shadow reads, dual-write strategy, cutover checklist, and rollback plan — with copy-pasteable Terraform.

Guide · 12 minRead →

Want the full benchmark PDF on your desk?

32 pages — every histogram bucket, the flamegraphs, and the cost analysis vs DynamoDB, Redis Enterprise, and Cassandra at 1B keys.

Email me the PDF ↗ Book a 30-min demo