If you can't reproduce it, it didn't happen.
Most database benchmarks are marketing in a lab coat. They run on hand-tuned hardware, with cherry-picked workloads, and they never release the harness. We did the opposite — and we'll send you the test rig.
VeltrixDB sells itself on three numbers — 4.9ms P99, 1 billion keys, 2M ops/sec per node. Each one is a claim that breaks if you change the hardware or the workload. So before we go any further: every number on the performance page traces back to a Prometheus scrape, a flamegraph, or a histogram bucket we'd be happy to walk you through line by line.
This article documents the exact methodology — the cluster topology, the kernel tuning, the workload generator, the percentile distributions, and the surprises we found along the way. By the end, you should be able to set up your own rig in an afternoon and reproduce every number we publish to within ±3%.
If anything below smells like a marketing benchmark, we want to hear about it. Send us your questions — we'll either fix the harness or write a follow-up explaining why a number is the way it is.
Commodity NVMe, no exotic silicon.
We deliberately picked the most boring infrastructure the cloud has to offer. If the numbers required Optane or custom FPGAs, they wouldn't be reproducible — and they wouldn't be cheap.
The cluster runs on Google Cloud N2 instances with local NVMe SSDs. Local NVMe is the only storage class we recommend for production VeltrixDB — EBS, Azure managed disks, and any other network-attached storage will break the latency model entirely. That's not us being precious; it's physics.
| Instance type | n2-standard-64-lssd · 64 vCPU, 256 GB RAM, 8 × 375 GB local NVMe |
| Cluster topology | 3 nodes, single region (us-central1), three availability zones, anti-affinity enforced |
| Storage layout | 8 NVMe drives per node, mdadm RAID-0 striped, ext4 with noatime,nodiratime,discard=async |
| Kernel | Linux 6.1.0-21-cloud-amd64 · io_uring SQPOLL · AVX2 · transparent hugepages set to madvise |
| Network | gVNIC · 100 Gbps placement group · jumbo frames enabled (MTU 8896) |
| VeltrixDB build | v0.9.4-beta · commit cb44a1f · 1024 shards, 256 GB LIRS cache, group-commit WAL fsync |
| Workload host | 4 × n2-standard-32 instances generating load in parallel, same region, same placement group |
Total cluster spend at on-demand pricing: ~$5,800/month. For comparison, DynamoDB at the same blended ops rate would have invoiced around $28,500/month for the same 30-day window — and that's before storage and egress.
Realistic, not flattering.
A benchmark that runs only point reads against a fully-cached working set will lie to you. Ours doesn't — we keep the writes coming, the cache pressure real, and the key distribution Zipfian.
We use a 70/30 read/write mix with Zipfian key access (α = 0.99) — the same distribution you see in production session stores, ad-tech profile lookups, and rate-limiter buckets. About 6% of reads miss the cache and fall through to NVMe, which is the worst case we care about.
Workload profile
- Key space · 1,000,000,000 unique 16-byte keys
- Value size · 128 bytes (Zipfian P95: 256 B) — representative of session/profile blobs
- Read/write mix · 70% GET, 25% SET, 5% DEL
- Access distribution · Zipfian α = 0.99 (heavy tail, realistic)
- Concurrency · 1024 client connections per load host × 4 hosts = 4,096 total
- Protocol · TCP binary, pipelined, no batching tricks
- Warm-up · 30 minutes — long enough for the LIRS cache to converge
- Measurement window · 60 minutes of steady state, sampled at 1Hz
What we did not do. We did not pre-warm the cache with a "perfect" hit pattern, we did not pin clients to specific shards, and we did not disable background GC for the measurement window. The numbers below are with compaction and garbage collection running at full throttle, exactly as they would in production.
Load generator
The harness is a Rust-based generator that talks the VeltrixDB binary protocol natively. It records every operation's wall-clock latency, the kernel's tcp_rtt at submission, and the cache-hit flag returned by the server. Histograms are HdrHistogram-based and merged across all four load hosts every 10 seconds.
# 1B-key benchmark · 70/30 mix · Zipfian α=0.99 [cluster] endpoints = ["10.142.0.11:7100", "10.142.0.12:7100", "10.142.0.13:7100"] protocol = "binary" keepalive = true [workload] key_space = 1_000_000_000 value_size_b = 128 distribution = "zipfian" zipf_alpha = 0.99 read_pct = 70 write_pct = 25 delete_pct = 5 [concurrency] connections = 1024 pipeline_depth = 16 [run] warmup_min = 30 measure_min = 60 sample_hz = 1
The full percentile distribution.
We don't publish an average. Averages hide tail latency, and tail latency is where databases earn — or lose — their keep.
Every number below is the merged percentile across the full 60-minute measurement window, sampled at 1Hz, with both load generators running. P99 is the headline. P99.9 is the number your SREs will actually page on.
Cache-hit reads — the easy case
About 94% of reads in steady state never touch NVMe. They're served from the LIRS cache in under 0.3 ms end-to-end. The CDF is so steep here it's almost not worth charting — P99 cache-hit latency is 0.31 ms, P99.9 is 0.42 ms.
Cache-miss reads — the only number that matters
The interesting number is what happens when the LIRS cache misses and we have to go to disk. io_uring SQPOLL + O_DIRECT + 1024 dedicated NVMe submission queues keep P99 at 4.9 ms even when 6% of reads fall through and compaction is actively churning in the background.
Writes — the durability path
Group-commit WAL fsync amortizes the cost of durability across concurrent writers. P99 write latency was 0.21 ms, of which roughly ~80 µs is the actual fsync. Disk fsync latency is the floor — we just made sure not to add anything on top of it.
Set up your own rig in an afternoon.
Every detail you need to validate these numbers on your own hardware — the Helm values, the load generator, the kernel sysctls. Email us and we'll ship the binaries.
You'll need a cloud account, three n2-standard-64-lssd instances, a Linux ≥ 5.10 kernel, and about three hours of patience. The full harness lives in a private repo today — drop us a note and we'll add you to it.
Step 1 — bring up the cluster
The Helm chart sets up the StorageClass, StatefulSet, anti-affinity rules, and ServiceMonitor in a single command. With the right values file, you're 94 seconds from a healthy 3-node cluster:
$ helm repo add veltrixdb https://charts.veltrixdb.com "veltrixdb" has been added to your repositories $ helm install bench veltrixdb/veltrixdb \ --namespace veltrixdb --create-namespace \ -f bench-values.yaml ✓ StorageClass · veltrixdb-nvme ✓ StatefulSet · 3 replicas, anti-affinity ✓ ServiceMonitor · 50+ metrics ✓ PodDisruptionBudget · minAvailable 2 ✓ Ready in 94s
Step 2 — tune the host
VeltrixDB needs a few host-level tunings to deliver these numbers. The Helm chart applies the pod-level pieces automatically; the node-level ones you have to set yourself:
- Transparent hugepages ·
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled - Swap · off · the ART index lives in RAM and must not be paged
- CPU governor ·
performanceon all cores — don't let the kernel race-to-idle on tail latency - NUMA · pin the VeltrixDB pod to a single NUMA node when possible
- Kernel · ≥ 5.10 for full io_uring SQPOLL support
Step 3 — load 1 billion keys
The harness ships with a populator that streams 1B keys at ~3M writes/sec sustained. Total fill time is around 5.5 minutes on our reference cluster — well under coffee break.
Step 4 — measure
Run the workload for 30 minutes warm-up, then 60 minutes measurement. The harness prints the merged HdrHistogram at every 10-second tick. If your P99 is more than 5% off from 4.9 ms, something is wrong — open the Grafana dashboard the chart installs at /grafana and check the per-shard tail latency.
We'll help. If you set this up and the numbers don't match, that's a bug we want to fix. Email us with your hardware spec and the histogram output — we will turn around a root-cause analysis within 48 hours.