Troubleshooting
The most common failure modes and how to recover.
"Model dropdown is empty / stuck on loading…"
The chat could not reach the gateway, or the gateway has zero models.
# 1. Is the gateway up?
docker compose ps litellm-proxy
curl -s http://localhost:4000/health/liveliness
# 2. Does the gateway know about any models?
curl -s -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
http://localhost:4000/v1/models | jq '.data | length'
# 3. Can the chat reach the gateway from inside the network?
docker compose exec librechat \
wget -qO- http://litellm-proxy:4000/health/livelinessIf (2) returns 0, register a model with
./scripts/add-model.sh.
If (3) returns bad address, the Docker network is broken — restart
the stack: docker compose down && docker compose up -d.
"Server listening" never appears in chat logs
Missing or malformed .env. Stop the stack, re-run ./bootstrap.sh,
restart.
docker compose down
./scripts/bootstrap.sh
docker compose logs -f librechatLogin works on chat, console redirects to /unauthorized
The session cookie is not reaching the console origin.
Check, in order:
COOKIE_DOMAINis set in.env(e.g..nufi.me).COOKIE_SAMESITE=lax(Strict blocks cross-subdomain).- The reverse proxy is rewriting
Set-CookiewithDomain=.nufi.me. - Both chat and console are HTTPS, not mixed.
- The shared
JWT_REFRESH_SECRETis identical in the chat and console env.
See SSO + reverse proxy.
network <name> not found (nufi-chat shared-network mode)
Shared-network mode is on but the named external network does not exist. Either:
- Create the network (or its owning stack).
- Change
SHARED_DOCKER_NETWORKin.envto a network that does exist. - Disable shared-network mode:
rm docker-compose.override.yml && docker compose up -d.
denied: denied when pulling images
Not logged in to GHCR, or your PAT expired.
echo <new-pat> | docker login ghcr.io -u <gh-username> --password-stdin
docker compose pullGateway healthcheck never green
docker compose logs litellm-proxy --tail 200Common causes:
- Postgres still warming up — wait 30 seconds and retry. If it
persists, check
docker compose logs postgresfor crashes. - Bad model in
litellm/config.yaml— typo inmodel:, missing env var. The gateway logs the offending entry. LLM_GUARD_API_BASEunreachable — LLM Guard took longer than expected to come up (DeBERTa cold-start). Wait, or setstart_period: 180son its healthcheck.
Langfuse traces stop appearing
docker compose logs langfuse-worker --tail 200Usually one of:
- ClickHouse out of disk.
- MinIO out of disk.
- The gateway not configured to emit (check
LANGFUSE_*env on the gateway).
High RAM, OOM kills
ClickHouse and Langfuse worker are the heavy hitters. Cap them in
compose (deploy.resources.limits.memory). If you cap ClickHouse, you
may also want to lower its max_memory_usage and
max_bytes_before_external_group_by query settings.
Disk filling fast
ClickHouse and MinIO grow with traffic. Plan for ~50 MB per 1000 traces. See Backup and restore for the retention story and Infra sizing for split points.
Admin panel sessions keep expiring
Default idle timeout is 30 minutes. Raise it via
ADMIN_SESSION_IDLE_TIMEOUT_MS in the admin-panel container env. The
session revalidates against the chat backend every 60 s — if the upstream
the chat backend is briefly unhealthy, the revalidation fails and the panel
signs the user out. Check docker compose logs chat.
"Conversation not found" after an upgrade
Usually a chat schema change. Tail the chat logs while loading the conversation:
docker compose logs -f librechatIf you see schema-migration errors, you crossed a major NUFI version. See Upgrade chat backend. Worst case: restore the previous MongoDB backup and re-do the upgrade following the upstream migration notes.
Console JIT-provision fails
docker compose logs console --tail 100 | grep -i provisionCommon causes:
- Master key mismatch between console env and gateway env.
- The gateway is down or unhealthy.
- The gateway database connection lost (look in the gateway logs).
JIT-provision is idempotent — once the gateway is back, the next console visit completes the provision.
Where to look first
Three commands cover 80 % of incidents:
docker compose ps # any service unhealthy?
docker compose logs -f --tail 200 # last lines of every service
curl -s http://localhost:9090/api/v1/alerts | jq # firing alerts