RAG Integration
How Retrieval-Augmented Generation fits into NUFI Chat today, and the path to the Qdrant + NuFi Flow vision.
This is a design/research note, not a how-to. It maps what RAG (Retrieval-Augmented Generation) already exists inside NUFI Chat, what it would take to turn on, and how that near-term capability lines up with the longer-term NUFI platform vision (Qdrant vector store + NuFi Flow no-code pipelines). Read it before committing to a RAG architecture so we pick the path that doesn't have to be unwound later.
TL;DR
- NUFI Chat is a fork of LibreChat, which already ships a complete RAG
subsystem — a
rag_apisidecar backed by pgvector. It is off by default and turns on with a few environment variables and two extra containers. - For a demo we can enable in-product document Q&A today with no application code changes. The only NPU-specific wiring is pointing the embeddings call at an NPU-served, OpenAI-compatible embeddings model.
- The platform vision (Qdrant + NuFi Flow + Langfuse) is a different layer, not a replacement. NuFi Flow authors pipelines and exposes them as an OpenAI-compatible endpoint that Chat consumes through the Gateway like any other model. The two can coexist; we recommend starting with native RAG and adding the Flow path as pipelines mature.
Two meanings of "RAG" in NUFI
The platform matrix lists RAG in two places, and they are not the same feature:
| Native Chat RAG (L3 file Q&A) | NuFi Flow RAG (L2 pipeline authoring) | |
|---|---|---|
| Who uses it | End users — drop a file into a chat | Engineers / DS — build a pipeline visually |
| Vector store | pgvector (ships with LibreChat) | Qdrant |
| Trigger | File upload in the chat composer | A designed Langflow graph |
| Surface | Built into NUFI Chat | NuFi Flow (Langflow fork) → REST endpoint |
| Status | Present, just disabled | Roadmap (Phase 3) |
Confusing the two leads to wasted work (e.g. trying to make Chat talk to Qdrant directly). Keep them as distinct integration paths — described below as Path A and Path B.
What already exists (LibreChat native RAG)
LibreChat's RAG is a separate microservice the app talks to over HTTP. Nothing in
the chat backend embeds or stores vectors itself — it delegates to rag_api.
┌──────────────┐ upload ┌───────────────┐ embed ┌──────────────────┐
│ NUFI Chat │────────────▶│ rag_api │───────────▶│ embeddings model │
│ (api) │ /embed │ (FastAPI) │ OpenAI │ (OpenAI-compat) │
│ │◀────────────│ │ /v1/embeddings │
└──────┬───────┘ citations └──────┬────────┘ └──────────────────┘
│ ask question │ similarity search
▼ ▼
┌──────────────┐ ┌───────────────┐
│ LLM (gateway)│ │ pgvector │
└──────────────┘ │ (Postgres) │
└───────────────┘Where this lives in the fork (dudaji-vn/nufichat):
api/server/services/Files/VectorDB/crud.js— therag_apiclient.uploadVectors()POSTs toRAG_API_URL/embed;deleteVectors()DELETEs fromRAG_API_URL/documents. Calls are authenticated with the user's JWT.api/server/services/Files/process.js— the "dual storage" router: RAG-eligible files go to embedding and object storage; everything else uses plain storage.rag.yml— a ready-made Compose stack withvectordb(pgvector) andrag_api..env.example(RAG section) —RAG_API_URL,RAG_OPENAI_BASEURL,RAG_OPENAI_API_KEY,EMBEDDINGS_PROVIDER,EMBEDDINGS_MODEL.
The retrieval flow at query time: when a conversation references an embedded file,
the chat backend asks rag_api for the most similar chunks, injects them into the
prompt, and the answer carries file citations back to the UI.
Path A — turn on native RAG (near-term, demo-ready)
This is the fast path. It needs no application code — only deployment config in
the nufi-chat deployment repo.
-
Add the RAG services. Bring
rag.yml'svectordb+rag_apiinto the deployment (either by merging intodocker-compose.ymlor running the extra Compose file alongside it). Put them on the same network as theapicontainer. -
Point Chat at the RAG API — set on the
apicontainer:RAG_API_URL=http://rag_api:8000 -
Choose where embeddings run.
rag_apicalls an OpenAI-compatible/v1/embeddingsendpoint. Point it at the NUFI gateway / an NPU-served embeddings model:EMBEDDINGS_PROVIDER=openai EMBEDDINGS_MODEL=<npu-embeddings-model-id> RAG_OPENAI_BASEURL=https://nufi.<domain>/v1 # NUFI Gateway RAG_OPENAI_API_KEY=<gateway key>This is the only NPU-specific step: the embeddings model is migrated with NuFi Migrate, served by NuFi Serve, and consumed here exactly like any other OpenAI-compatible model. No SDK changes.
-
Verify. Upload a PDF in a chat, ask a question about it, and confirm the answer comes back with citations. (
docker compose logs -f rag_apishows the embed + search calls.)
Trade-offs. pgvector is perfectly adequate for single-node, moderate corpora. It is not the Qdrant store named in the platform vision, so a later move to Path B means re-embedding into Qdrant — but the user-facing behaviour is identical, so Path A is a safe demo and pilot vehicle.
Path B — NuFi Flow + Qdrant (strategic, no-code pipelines)
The platform vision puts RAG authoring in NuFi Flow (a Langflow fork) with Qdrant as the vector store and Langfuse for prompt tracing. The key insight that keeps this clean: a Flow pipeline is exported as an OpenAI-compatible REST endpoint, so NUFI Chat consumes it through the Gateway as if it were just another model — no bespoke Chat ↔ Qdrant integration required.
Engineer builds in NuFi Flow ──▶ pipeline (PDF → chunk → embed → Qdrant → LLM)
│ exported as /v1/chat/completions
▼
NUFI Chat ──▶ NUFI Gateway ──▶ "flow-rag-pipeline" model ──▶ Qdrant + NPU LLM
│
└──▶ Langfuse (prompt tracing)What this buys us that Path A does not:
- No-code iteration on retrieval strategy (chunking, rerankers, multi-step) by DS/engineers, without touching Chat.
- Qdrant as a first-class, horizontally-scalable store shared across apps.
- Langfuse tracing of each retrieval + generation step.
What it costs: standing up Qdrant, NuFi Flow, and Langfuse, plus defining the pipeline→endpoint contract. This is Phase 3 work, not a same-day demo.
How the two paths relate
They are complementary layers, not competitors:
- Path A is the built-in, "just upload a file" experience for end users.
- Path B is the authored, governed pipeline experience for builders, surfaced back into Chat as a selectable model.
A reasonable rollout: ship Path A now for in-product document Q&A and the demo; stand up Path B as NuFi Flow matures; once a Qdrant-backed Flow pipeline is the canonical path, decide per-tenant whether native upload RAG stays on.
Audit & compliance tie-in
The enterprise story requires that RAG source documents are recorded alongside each answer ("model version + RAG source documents recorded" in the WORM audit spec). The admin audit log added for NUFI Chat is the natural home for this: extend the audit entry to capture, per message, the retrieved document ids / file names that fed the answer. Path A already knows these (the citation set); Path B can pass them back in the pipeline response metadata. See the admin audit log work for the entry shape to extend.
Recommendation
- Now / demo: enable Path A (native pgvector RAG) with NPU-served embeddings via the Gateway. Zero app-code, ships today.
- Next: prototype a single Path B pipeline in NuFi Flow against Qdrant, exposed as one Gateway model, to validate the export→consume contract.
- Then: wire RAG source documents into the audit log for the compliance story.
- Decide later: per-tenant, whether native upload RAG and Flow pipelines both stay enabled, or Flow becomes the single path.
References
rag.yml,.env.example(RAG section) — deployment config in the fork.api/server/services/Files/VectorDB/crud.js,api/server/services/Files/process.js— the native RAG client + routing.- LibreChat RAG API docs: librechat.ai/docs/configuration/rag_api
- NUFI platform matrix — Phase 3 (NuFi Studio): Qdrant, NuFi Flow, Langfuse.