Moving from DynamoDB to VeltrixDB in two sprints

Why teams migrate

Two reasons, always the same.

Every team that walked us through a DynamoDB-to-VeltrixDB migration showed up with the same two concerns. The bill, and the tail.

If you're reading this guide, you probably already know your why. But for the engineering manager you'll be selling this migration to, the case lands cleanest as two numbers:

Cost · DynamoDB on-demand bills at roughly $1.00 per million blended ops. At 10 B ops/month that's $10K. Most VeltrixDB customers land between 3–5× lower on the same workload. The cost calculator is the fastest way to see the gap.
Tail latency · DynamoDB's published P99 is ~10ms for single-key reads. Real-world it's closer to 12-18ms with throttling and burst-bucket effects. VeltrixDB is 4.9ms sustained, with no throttling tier.

Median migration time

2 sprints

across 4 prod migrations

P99 latency drop

12 ms → 4.9 ms

~60% reduction

Cost reduction

3-5×

on equivalent ops

The rest of this guide is the actual playbook — the same one our solutions architects walk Enterprise customers through. Two sprints. No big-bang cutover. Rollback at every stage.

Pre-flight

Confirm you're in scope — before you book the project.

VeltrixDB is a scalpel, not a Swiss Army knife. Spend an hour validating fit before you spend two sprints validating performance.

Before you start, run the table through this checklist. If any row is a "no," talk to us before booking the migration — there's a good chance VeltrixDB is the wrong tool for that workload.

Workload fit

Your hot path is point lookups — GET / SET / DEL / HGET-style. No Query calls with sort-key range conditions or Scan.
No global secondary indexes (GSIs) on the critical path. Or you can model the GSI access patterns as separate keys.
You have a partition key in DynamoDB that can map 1:1 to a VeltrixDB key. Composite (PK + SK) keys map to pk:sk string concatenations.
Values are ≤ 400 KB (DynamoDB's own limit). VeltrixDB supports up to 64 MB, but smaller values keep your RAM budget honest.
You can run on bare-metal NVMe or local SSD. EBS, Azure managed disks, and any network-attached storage are not supported.

If you're using DynamoDB Streams · we ship a sidecar that consumes a CDC stream from your existing table during the migration window — you keep the rest of your event-driven architecture intact. Drop us a note before sprint 1 and we'll provision the right consumer for your event volume.

The plan

Two sprints. Five stages.

No big-bang cutover. Every stage runs DynamoDB in parallel until the very last hour. Rollback is one config flip away at every checkpoint.

Migration timeline · 2 sprints / 14 working days

Days 1–2 Sprint 1 · Stage 1

Stand up the cluster

Helm install in your VPC. 94-second deploy. Wire up Prometheus + Grafana. Validate health checks against an empty cluster. No production traffic yet.

Days 3–5 Sprint 1 · Stage 2

Bulk-load the snapshot

Use DynamoDB's ExportTableToPointInTime to S3, then run our bulk-importer. ~3M writes/sec sustained — a 1B-key table fills in about 5.5 minutes. Validate row count and a sample of values match.

Days 6–9 Sprint 1 · Stage 3

Enable dual-writes

Route all writes to both DynamoDB and VeltrixDB synchronously. Failure handling: log mismatches, alert on divergence > 0.1%, but do not roll back individual writes. DynamoDB remains the source of truth.

Days 10–11 Sprint 2 · Stage 4

Shadow reads

For each read served by DynamoDB, fire an async read against VeltrixDB. Compare responses, log mismatches, surface P99 latency on a side-by-side Grafana dashboard. Run for a full 48-hour weekend cycle.

Days 12–13 Sprint 2 · Stage 5

Gradual cutover

Flip the read source one cohort at a time — 5%, 25%, 50%, 100%. Each step requires the previous step's P99 to be stable for 4 hours. Rollback is a feature-flag flip, not a deploy.

Day 14 Wrap-up

Decommission DynamoDB writes

Drop the dual-write fan-out from your app. Optionally keep the DynamoDB table read-only for 7 days as cold backup. Save your final invoice — it's the receipt you bring to the QBR.

Infrastructure

Terraform you can paste into your repo.

Three resources, eight inputs. Drop into your existing EKS module, change the instance type to a local-NVMe SKU, run terraform apply.

This snippet provisions a 3-node EKS-hosted VeltrixDB cluster on i3en.6xlarge instances (the AWS equivalent of our reference GCP rig). Adjust node count, instance class, or replace the EKS bits with GKE/AKS modules as needed.

terraform · veltrixdb.tf

# Helm release wrapping the VeltrixDB chart
resource "helm_release" "veltrixdb" {
  name       = "veltrixdb"
  repository = "https://charts.veltrixdb.com"
  chart      = "veltrixdb"
  version    = "0.9.4"
  namespace  = "veltrixdb"
  create_namespace = true

  values = [templatefile("./values.yaml.tpl", {
    nodes          = 3
    cache_gb       = 256
    nvme_gb        = 3000
    storage_class  = "veltrixdb-nvme"
    region         = "us-east-1"
  })]
}

# Local NVMe StorageClass (i3en family)
resource "kubernetes_storage_class" "nvme" {
  metadata { name = "veltrixdb-nvme" }
  storage_provisioner = "kubernetes.io/no-provisioner"
  volume_binding_mode = "WaitForFirstConsumer"
  parameters = {
    type = "local-ssd"
    fsType = "ext4"
  }
}

# Node group: i3en.6xlarge — 24 vCPU, 192 GB RAM, 7.5 TB local NVMe
module "veltrix_nodes" {
  source           = "terraform-aws-modules/eks/aws//modules/eks-managed-node-group"
  cluster_name     = var.eks_cluster_name
  name             = "veltrixdb-nodes"
  instance_types   = ["i3en.6xlarge"]
  desired_size     = 3
  min_size         = 3
  max_size         = 6
  labels = { workload = "veltrixdb" }
  taints = [{ key = "workload", value = "veltrixdb", effect = "NO_SCHEDULE" }]
}

Bulk-import from DynamoDB

The import runs as a Kubernetes Job that streams from your DynamoDB S3 export directly into the new VeltrixDB cluster. It's idempotent — safe to re-run if it crashes partway through.

~/migration · zsh

$ aws dynamodb export-table-to-point-in-time \
    --table-arn arn:aws:dynamodb:us-east-1:123456789012:table/sessions \
    --s3-bucket veltrix-migration-staging \
    --export-format DYNAMODB_JSON

$ kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata: { name: veltrix-import, namespace: veltrixdb }
spec:
  template:
    spec:
      containers:
      - name: importer
        image: ghcr.io/veltrixdb/ddb-importer:0.9.4
        args: ["--s3-prefix", "s3://veltrix-migration-staging/AWSDynamoDB/.../"]
      restartPolicy: OnFailure
EOF

$ kubectl logs -f -n veltrixdb job/veltrix-import
imported 247,308,124 rows · 5m23s · 0 mismatches · ✓

Dual writes

Fan out writes, keep DynamoDB as the source of truth.

During sprint 1, every write goes to both databases synchronously. If divergence climbs above 0.1%, an alarm fires and you fix it before you ever flip reads.

The dual-write step is where most botched migrations go wrong. The temptation is to fan out writes asynchronously to keep latency down — don't. Async fan-out hides bugs that only surface during cutover. Run the writes synchronously, in parallel, and accept the latency cost during the migration window.

Reference implementation

The pattern, in pseudocode, looks like this. The key invariants: write to DynamoDB first (still source of truth), then write to VeltrixDB, then log any inconsistencies — but never roll back a successful DynamoDB write because the VeltrixDB write failed.

repo/sessions/dual_writer.go

func (r *Repo) Put(ctx context.Context, k string, v []byte) error {
    // 1. Source of truth still wins
    if err := r.ddb.Put(ctx, k, v); err != nil {
        return err
    }
    // 2. Mirror to VeltrixDB · best-effort
    go func() {
        cctx, cancel := context.WithTimeout(ctx, 200*time.Millisecond)
        defer cancel()
        if err := r.veltrix.Put(cctx, k, v); err != nil {
            metrics.DualWriteFail.WithLabelValues("veltrix").Inc()
            log.Warn("veltrix mirror failed", "key", k, "err", err)
        }
    }()
    return nil
}

Watch the divergence metric. A healthy dual-write phase should see <0.01% mirror failures, almost entirely network blips. If you see >0.1%, something is wrong — most often a payload that exceeds 400 KB in DynamoDB but fits in VeltrixDB's 64 MB envelope and gets corrupted during your encoder's roundtrip. Fix before proceeding.

Shadow reads

Read both. Trust neither. Compare everything.

Before flipping the read path, you spend 48 hours reading from both databases and comparing the answers byte-for-byte. This is where you catch the bugs that bulk-import didn't.

The shadow-read phase runs on the same dual-write pattern, inverted. For every read your application makes, fire a second read against VeltrixDB, compare, log mismatches, and surface the side-by-side P99 chart in Grafana. Your users still see DynamoDB's answer — VeltrixDB is on probation.

Shadow-read exit criteria

Mismatch rate < 0.001% over a full weekend cycle (catches any time-zone or weekend-batch effects)
VeltrixDB P99 read latency < 8ms for 4 hours straight under peak prod load
Cache hit rate > 90% in steady state — the LIRS cache should be warm by now
Zero 5xx errors on the VeltrixDB endpoint during the window
An incident runbook merged that documents the rollback procedure for the next stage

Cutover

5 → 25 → 50 → 100. No big-bang flips.

Move the read traffic in four feature-flag controlled steps. Each step must be stable for four hours before the next. Rollback at any step is a single config change.

We've never seen a migration fail at this stage if shadow-reads were clean — but we still walk through the cohorts. Discipline here is what makes the postmortem boring.

Cutover cohorts · 16-hour walk

0h–4h

5% of read traffic → VeltrixDB

Flag read_source=veltrix applied to a 5% user-id hash range. Watch P99, error rate, customer-support tickets. No deploys during this window.

4h–8h

25%

Same hash-range scheme, expanded. If the 5% window showed no anomalies, the 25% should be uneventful. Halt if you see anything you don't recognise.

8h–12h

50%

Half your traffic now reads from VeltrixDB. This is the step where any cold-cache effects would surface in production — they shouldn't, because shadow reads warmed the cache for 48 hours, but watch the hit rate anyway.

12h–16h

100%

All traffic on VeltrixDB. DynamoDB still receiving writes (you haven't decommissioned the dual-write yet — that's the next step). Save the Grafana snapshot. Send it to your engineering manager.

Rollback

If anything goes wrong at any step, the rollback is a single feature-flag flip: read_source=dynamodb. Because writes are still going to both databases, DynamoDB never went stale. You lose nothing, and you debug at leisure.

✓

In four migrations to date, zero have needed a rollback at the cutover stage. The two times shadow-reads surfaced a problem (a serialization quirk and a Unicode normalization edge case), the team caught it during the 48-hour shadow window, fixed it, and proceeded clean. That's the entire point of running shadow-reads.

Wrap-up

Decommission, save the receipts.

A week after cutover, drop the dual-write fan-out and put the DynamoDB table into read-only mode. Keep it warm for a month as cold storage. Then archive it.

The decommission step is procedural — no surprises. Drop the dual_writer.go path from your repo, point the DynamoDB IAM role at a read-only policy, and schedule a calendar reminder for 30 days out to delete the table. Most teams keep the table around longer than they need to — that's fine, and the cost is trivial.

What's not trivial: save the DynamoDB invoice from the month before cutover. Compare it to the VeltrixDB invoice three months later. That delta is the receipt you bring to the next budget review. We've seen customers cut their annual database spend by $340K to $1.2M, and that's the number your CFO will want to see in writing.

Playbook based on four production migrations between Q4 2025 and Q1 2026 · median duration 11 working days · zero rollbacks.

Cost calculator Architecture deep-dive Talk to a solutions architect

Continue reading

Benchmark report

The 1B-Key Benchmark — methodology & raw data

Hardware specs, workload generator config, percentile distributions, and replayable harness.

PDF · 32 pagesRead →

Architecture deep-dive

Key-value separation & why compaction shouldn't touch values

How the VLog, io_uring, and LIRS cache combine to keep P99 stable at any write load.

Engineering blog · 18 minRead →

Want a solutions architect on the migration?

Enterprise tier includes a named solutions architect end-to-end — Terraform, schema mapping, cutover plan, and the on-call escalation during cutover weekend. No additional cost.

Talk to sales ↗ See pricing