Backup and restore

The state that matters lives in four volumes. Everything else (Prometheus metrics, Redis cache) is rebuilt automatically and not worth backing up.

What to back up

Volume	What is in it	Frequency	RPO target
`postgres-data`	Gateway keys, budgets; trace metadata	Nightly	24 h
`mongodb-data`	chat conversations, users, agents	Nightly	24 h
`clickhouse-data`	Langfuse traces	Weekly	7 d
`minio-data`	Langfuse blob payloads	Weekly	7 d

Postgres + MongoDB are the business-critical stores. Without them you lose user accounts and conversation history. ClickHouse + MinIO are forensic — losing them costs you observability history, not operational state.

Postgres

Logical dump on a schedule:

docker compose exec -T postgres \
  pg_dumpall -U "$POSTGRES_USER" \
  | gzip > "/backups/pg_$(date +%F).sql.gz"

Stick that in cron or systemd-timer. Keep 30 days. Restore:

gunzip < pg_2026-05-25.sql.gz | \
  docker compose exec -T postgres psql -U "$POSTGRES_USER"

MongoDB

docker compose exec -T mongodb \
  mongodump --archive --gzip \
  > "/backups/mongo_$(date +%F).gz"

Restore:

docker compose exec -T mongodb \
  mongorestore --archive --gzip --drop < mongo_2026-05-25.gz

--drop clears existing collections first — use only on a fresh target.

ClickHouse + MinIO

Both are too large for daily logical dumps. Snapshot the underlying volumes weekly instead:

docker run --rm \
  -v npuops_clickhouse-data:/data \
  -v "$PWD/backups:/out" \
  alpine tar czf "/out/clickhouse_$(date +%F).tar.gz" /data

…and the same for npuops_minio-data. Restore by extracting back into the volume while the service is stopped.

For larger deployments, replicate to off-host storage — S3, NFS, restic, borg, your IT team's preferred tool. Don't keep backups on the same disk as production data.

Ship backups off-host

Whatever scheduler you use, end the script with a copy to off-host storage. Examples:

# Rsync to a separate VM
rsync -av /backups/ backup-vm:/srv/nufi/

# S3 / MinIO
aws s3 sync /backups s3://nufi-backups/nufi/

# B2 / Wasabi etc.
restic backup /backups

Restore drill

Schedule a drill once a quarter. The drill:

Spin up a parallel compose stack on a separate host.
Stop it. Replace its volumes with the latest backups.
Start it.
Sign in. Pick a recent conversation. Issue a chat. Confirm everything works.

Untested backups are not backups. The drill is the only way you learn the gotchas (e.g. ClickHouse migration mismatches if the schema moved forward) before an actual incident.

What you do not back up

Prometheus metrics. Cheap to lose; would only impact graphs.
Redis. Rate-limit counters and cache. Reset on restart.
Grafana data. Dashboards are provisioned from monitoring/grafana/dashboards/ (in git), so a fresh container reads them on first boot.
MinIO langfuse bucket bootstrap. Recreated by minio-init on startup.

On Postgres tablespace growth

Langfuse stores metadata in Postgres and traces in ClickHouse. That keeps Postgres small (~1–5 GB at moderate scale). If your Postgres dump grows past ~10 GB, something else is wrong — typically The gateway's spend logs not being purged. Set spend_logs_max_size in litellm/config.yaml to enforce retention.