RAG pipeline: document ingestion and retrieval

Retrieval-Augmented Generation (RAG) is a technique that grounds LLM responses in your own documents. Instead of relying solely on what the model learned during training, it first searches your document collection for relevant passages, then passes those passages to the LLM as context. The result is more accurate, citation-backed answers that stay within your data’s scope. Shipfastai’s RAG pipeline handles document ingestion, chunking, embedding, semantic search, and augmented generation — all behind a simple REST API. RAG endpoints live under /api/rag.

The RAG pipeline requires the Pro or Enterprise tier. Basic tier accounts cannot access these endpoints.

Ingesting documents

Before you can query your documents, you need to ingest them into the vector store. Shipfastai supports two ingestion endpoints. Ingest plain text Send raw text content to POST /api/rag/ingest/text. The pipeline splits the text into overlapping chunks, embeds each chunk using OpenAI embeddings, and stores the result. Every chunk is tagged with your user_id for automatic isolation.

Request

POST /api/rag/ingest/text
Authorization: Bearer <access_token>
Content-Type: application/json

{
  "content": "FastAPI is a modern, fast (high-performance) web framework for building APIs with Python 3.8+ based on standard Python type hints. The key features are: fast, fast to code, fewer bugs, intuitive, easy, short, robust, and standards-based.",
  "metadata": { "source": "fastapi-overview", "category": "framework-docs" },
  "chunk_size": 1000,
  "chunk_overlap": 200
}

Response — 200 OK

{
  "document_ids": ["a3f1b2c4_0", "d9e8f7g6_1"],
  "chunks_created": 2
}

Ingest a file Upload a .txt, .pdf, or .docx file using a multipart POST /api/rag/ingest/file request. The pipeline extracts text from the file and then follows the same chunking and embedding process.

cURL example

curl -X POST /api/rag/ingest/file \
  -H "Authorization: Bearer <access_token>" \
  -F "file=@handbook.pdf" \
  -F "chunk_size=1000" \
  -F "chunk_overlap=200"

The ingestion parameters are:

Field	Type	Default	Description
`content`	`string`	required	Raw text to ingest (text endpoint only).
`metadata`	`object`	`{}`	Arbitrary key-value pairs attached to every chunk.
`chunk_size`	`int`	`1000`	Maximum characters per chunk (100–10000).
`chunk_overlap`	`int`	`200`	Characters of overlap between adjacent chunks (0–2000).

Semantic search

Use POST /api/rag/search to find document chunks that are semantically similar to a query string, without involving the LLM. This is useful for debugging your knowledge base or building custom retrieval logic.

Request

POST /api/rag/search
Authorization: Bearer <access_token>
Content-Type: application/json

{
  "query": "What are the key features of FastAPI?",
  "top_k": 3,
  "filter": { "category": "framework-docs" }
}

Response — 200 OK

{
  "results": [
    {
      "id": "a3f1b2c4_0",
      "content": "FastAPI is a modern, fast (high-performance) web framework...",
      "score": 0.94,
      "metadata": {
        "source": "fastapi-overview",
        "category": "framework-docs",
        "chunk_index": 0,
        "total_chunks": 2,
        "user_id": "a1b2c3d4-0000-0000-0000-000000000001"
      }
    }
  ]
}

The filter field supports any metadata key-value pair you attached during ingestion. Results are automatically filtered to only include chunks belonging to your account.

RAG queries

Send a natural-language question to POST /api/rag/query. The pipeline embeds your question, retrieves the most relevant chunks, passes them to the LLM as context, and returns both the synthesized answer and the source documents used. Non-streaming query

Request

POST /api/rag/query
Authorization: Bearer <access_token>
Content-Type: application/json

{
  "question": "What makes FastAPI fast?",
  "top_k": 5,
  "min_score": 0.5,
  "stream": false,
  "chat_history": [
    { "role": "user", "content": "Tell me about Python web frameworks." },
    { "role": "assistant", "content": "There are many Python web frameworks..." }
  ],
  "filter": { "category": "framework-docs" }
}

Response — 200 OK

{
  "answer": "FastAPI achieves high performance through its use of Starlette for the web parts and Pydantic for the data parts. It is one of the fastest Python frameworks available, on par with NodeJS and Go.",
  "sources": [
    {
      "id": "a3f1b2c4_0",
      "content": "FastAPI is a modern, fast (high-performance) web framework...",
      "score": 0.94,
      "metadata": { "source": "fastapi-overview" }
    }
  ],
  "usage": {
    "prompt_tokens": 312,
    "completion_tokens": 45,
    "total_tokens": 357
  }
}

Streaming query Set stream: true to receive the answer as a Server-Sent Event stream, identical in format to the chat streaming endpoint. Each event contains a { "token": "..." } payload, and the stream ends with data: [DONE]. The full RAGQueryRequest schema:

Field	Type	Default	Description
`question`	`string`	required	The natural-language question to answer.
`top_k`	`int`	`5`	Number of document chunks to retrieve (1–50).
`min_score`	`float`	`0.5`	Minimum similarity score to include a chunk (0.0–1.0).
`stream`	`bool`	`false`	Stream the answer token by token.
`chat_history`	`array`	`null`	Prior conversation turns to provide context.
`filter`	`object`	`null`	Metadata filter applied during retrieval.

Vector store options

The RAG pipeline uses a pluggable vector store backend. Configure which backend to use in your environment variables.

FAISS (default)
Pinecone (managed cloud)
Chroma (self-hosted)

FAISS is the default vector store and requires no external service. It stores all vectors in memory and optionally persists them to disk. It is ideal for local development and small-to-medium datasets.

Environment

VECTOR_STORE_PROVIDER=faiss
FAISS_INDEX_PATH=./data/faiss.index  # optional persistence

No additional services are required. FAISS starts in-process alongside your FastAPI application.

Pinecone is a fully managed vector database suitable for production deployments with large datasets and high query volumes.

Environment

VECTOR_STORE_PROVIDER=pinecone
PINECONE_API_KEY=your-pinecone-api-key
PINECONE_INDEX_NAME=shipfastai-prod

Create your Pinecone index with a dimension that matches your embedding model (1536 for OpenAI text-embedding-ada-002).

Chroma is an open-source, self-hosted vector database that you can run alongside your stack using Docker.

Environment

VECTOR_STORE_PROVIDER=chroma
CHROMA_HOST=localhost
CHROMA_PORT=8001

Add a chroma service to your docker-compose.yml to run it locally alongside the backend.

Document isolation

Every document chunk is stored with a user_id metadata field automatically set to the ID of the authenticated user who ingested it. All search and query endpoints inject a user_id filter into every vector store query, so users can never retrieve each other’s documents — even if they use the same metadata keys. You do not need to add any user_id filter yourself; it is applied automatically. To delete a specific document chunk, call:

DELETE /api/rag/documents/{document_id}
Authorization: Bearer <access_token>

​Ingesting documents

​Semantic search

​RAG queries

​Vector store options

​Document isolation

Ingesting documents

Semantic search

RAG queries

Vector store options

Document isolation