RAG API: ingest documents and query with context

The RAG (Retrieval-Augmented Generation) API lets you build knowledge-grounded AI features. You ingest text or files into a per-user vector store, then search or query that store to get LLM answers backed by your own documents. All endpoints are available on the Pro and Enterprise tiers and are mounted under /api/rag/.

RAG endpoints are available on Pro and Enterprise plans only. The vector store is scoped per user — each user can only search and retrieve their own ingested documents.

POST /api/rag/ingest/text

Ingest raw text content into the vector store. The text is automatically split into overlapping chunks, embedded, and stored. Returns the generated document IDs and the number of chunks created. Headers:

Authorization

string

required

Bearer <access_token>

Request body:

content

string

required

The raw text to ingest. There is no enforced length limit, but very large documents will produce many chunks.

metadata

object

default:"{}"

Arbitrary key-value pairs attached to every chunk from this document. Useful for filtering later (e.g., {"source": "faq", "topic": "billing"}).

chunk_size

number

default:"1000"

Target character length of each chunk. Must be between 100 and 10000. The chunker attempts to break at sentence boundaries near this length.

chunk_overlap

number

default:"200"

Number of characters of overlap between adjacent chunks. Must be between 0 and 2000. Overlap improves recall by ensuring context is not lost at chunk boundaries.

import requests

response = requests.post(
    "http://localhost:8000/api/rag/ingest/text",
    headers={"Authorization": f"Bearer {access_token}"},
    json={
        "content": "Shipfastai is an AI-ready SaaS boilerplate for Python developers. It includes authentication, billing, and built-in RAG support out of the box.",
        "metadata": {"source": "product-docs", "topic": "overview"},
        "chunk_size": 500,
        "chunk_overlap": 100,
    },
)
print(response.json())

curl --request POST \
  --url http://localhost:8000/api/rag/ingest/text \
  --header "Authorization: Bearer <access_token>" \
  --header "Content-Type: application/json" \
  --data '{
    "content": "Shipfastai is an AI-ready SaaS boilerplate for Python developers...",
    "metadata": {"source": "product-docs"},
    "chunk_size": 500,
    "chunk_overlap": 100
  }'

Response — IngestResponse:

document_ids

string[]

required

List of IDs assigned to each stored chunk. Each ID is a 12-character MD5 hash prefix plus the chunk index (e.g., "a1b2c3d4e5f6_0").

chunks_created

number

required

Total number of chunks the text was split into and stored.

{
  "document_ids": [
    "a1b2c3d4e5f6_0",
    "b2c3d4e5f6a1_1"
  ],
  "chunks_created": 2
}

POST /api/rag/ingest/file

Ingest a file directly into the vector store. The file is parsed to plain text, then processed identically to /api/rag/ingest/text. Supported formats: .txt, .pdf, .docx. Headers:

Authorization

string

required

Bearer <access_token>

Form data (multipart/form-data):

file

required

The file to upload. Must be a .txt, .pdf, or .docx file.

chunk_size

number

default:"1000"

Target character length of each chunk. Must be between 100 and 10000.

chunk_overlap

number

default:"200"

Character overlap between adjacent chunks. Must be between 0 and 2000.

import requests

with open("documentation.pdf", "rb") as f:
    response = requests.post(
        "http://localhost:8000/api/rag/ingest/file",
        headers={"Authorization": f"Bearer {access_token}"},
        files={"file": ("documentation.pdf", f, "application/pdf")},
        data={"chunk_size": 800, "chunk_overlap": 150},
    )
print(response.json())

curl --request POST \
  --url http://localhost:8000/api/rag/ingest/file \
  --header "Authorization: Bearer <access_token>" \
  --form file=@documentation.pdf \
  --form chunk_size=800 \
  --form chunk_overlap=150

Response — IngestResponse (same shape as /ingest/text):

{
  "document_ids": [
    "c3d4e5f6a1b2_0",
    "d4e5f6a1b2c3_1",
    "e5f6a1b2c3d4_2"
  ],
  "chunks_created": 3
}

The metadata for file ingests automatically includes {"source": "<filename>"} in addition to any user-supplied metadata.

POST /api/rag/search

Perform a pure semantic search over the vector store without involving an LLM. Returns the most relevant chunks ranked by similarity score. Useful for building your own retrieval logic or debugging what is in the store. Headers:

Authorization

string

required

Bearer <access_token>

Request body:

query

string

required

The search query. The query is embedded and compared against stored chunk embeddings.

top_k

number

default:"5"

Maximum number of results to return. Must be between 1 and 50.

filter

object

Optional metadata filter to narrow results. Key-value pairs are matched against chunk metadata (e.g., {"source": "product-docs"}). The user_id filter is applied automatically — you do not need to include it.

import requests

response = requests.post(
    "http://localhost:8000/api/rag/search",
    headers={"Authorization": f"Bearer {access_token}"},
    json={
        "query": "How does billing work?",
        "top_k": 3,
        "filter": {"topic": "billing"},
    },
)
for result in response.json()["results"]:
    print(f"[{result['score']:.2f}] {result['content'][:120]}")

curl --request POST \
  --url http://localhost:8000/api/rag/search \
  --header "Authorization: Bearer <access_token>" \
  --header "Content-Type: application/json" \
  --data '{
    "query": "How does billing work?",
    "top_k": 3,
    "filter": {"topic": "billing"}
  }'

Response — SearchResponse:

results

object[]

required

Ordered list of matching chunks, most similar first.

Show result item properties

string

required

The chunk’s document ID.

content

string

required

The text content of the chunk.

score

number

required

Cosine similarity score between 0.0 and 1.0. Higher is more relevant.

metadata

object

required

The metadata attached to this chunk at ingestion time, plus user_id and chunking info.

{
  "results": [
    {
      "id": "a1b2c3d4e5f6_0",
      "content": "Shipfastai integrates with Stripe for subscription billing...",
      "score": 0.91,
      "metadata": {
        "source": "product-docs",
        "topic": "billing",
        "chunk_index": 0,
        "total_chunks": 2,
        "user_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6"
      }
    },
    {
      "id": "b2c3d4e5f6a1_1",
      "content": "You can manage your subscription through the Stripe customer portal...",
      "score": 0.84,
      "metadata": {
        "source": "product-docs",
        "topic": "billing",
        "chunk_index": 1,
        "total_chunks": 2,
        "user_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6"
      }
    }
  ]
}

POST /api/rag/query

Ask a natural language question. The API retrieves the most relevant chunks from the vector store and passes them to the LLM as context, returning a grounded answer along with the source chunks used. Supports streaming and optional conversation history for multi-turn sessions. Headers:

Authorization

string

required

Bearer <access_token>

Request body:

question

string

required

The natural language question to answer using the ingested documents.

top_k

number

default:"5"

Number of document chunks to retrieve as context. Must be between 1 and 50.

min_score

number

default:"0.5"

Minimum similarity score threshold between 0.0 and 1.0. Chunks scoring below this value are excluded from the context passed to the LLM.

stream

boolean

default:"false"

When true, the answer is streamed as server-sent events (SSE), using the same format as the AI Chat streaming endpoint.

chat_history

object[]

Optional conversation history for multi-turn queries. Each item must have a role ("user" or "assistant") and content. Providing history allows the model to resolve follow-up questions against prior context.

filter

object

Optional metadata filter applied during retrieval (e.g., {"source": "faq"}). The user_id filter is applied automatically.

import requests

response = requests.post(
    "http://localhost:8000/api/rag/query",
    headers={"Authorization": f"Bearer {access_token}"},
    json={
        "question": "How do I cancel my subscription?",
        "top_k": 4,
        "min_score": 0.6,
        "filter": {"source": "product-docs"},
    },
)
data = response.json()
print(data["answer"])
print(f"\nSources used: {len(data['sources'])}")

curl --request POST \
  --url http://localhost:8000/api/rag/query \
  --header "Authorization: Bearer <access_token>" \
  --header "Content-Type: application/json" \
  --data '{
    "question": "How do I cancel my subscription?",
    "top_k": 4,
    "min_score": 0.6,
    "filter": {"source": "product-docs"}
  }'

Response — RAGQueryResponse:

answer

string

required

The LLM-generated answer, grounded in the retrieved document chunks.

sources

object[]

required

The document chunks that were retrieved and used as context. Same shape as SearchResponse.results.

Show source item properties

string

Chunk document ID.

content

string

Text content of the chunk.

score

number

Similarity score used for retrieval.

metadata

object

Metadata attached to the chunk.

usage

object

Token usage for the LLM call. May be null if the provider does not return usage data.

{
  "answer": "You can cancel your subscription at any time through the Stripe customer portal. Navigate to your account settings and click 'Manage Subscription' to open the portal, where you can cancel, downgrade, or update your payment method.",
  "sources": [
    {
      "id": "b2c3d4e5f6a1_1",
      "content": "You can manage your subscription through the Stripe customer portal...",
      "score": 0.88,
      "metadata": {
        "source": "product-docs",
        "chunk_index": 1,
        "total_chunks": 2,
        "user_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6"
      }
    }
  ],
  "usage": {
    "prompt_tokens": 312,
    "completion_tokens": 58,
    "total_tokens": 370
  }
}

Multi-turn query example

Use chat_history to maintain context across follow-up questions:

import requests

history = []

def ask(question: str) -> str:
    response = requests.post(
        "http://localhost:8000/api/rag/query",
        headers={"Authorization": f"Bearer {access_token}"},
        json={"question": question, "chat_history": history},
    )
    data = response.json()
    history.append({"role": "user", "content": question})
    history.append({"role": "assistant", "content": data["answer"]})
    return data["answer"]

print(ask("What is included in the Pro plan?"))
print(ask("And how much does it cost?"))  # resolved against prior context

DELETE /api/rag/documents/

Delete a specific document chunk from the vector store by its ID.

document_id

string

required

The document chunk ID to delete, as returned by the ingest endpoints.

curl --request DELETE \
  --url http://localhost:8000/api/rag/documents/a1b2c3d4e5f6_0 \
  --header "Authorization: Bearer <access_token>"

Response:

{
  "status": "deleted",
  "document_id": "a1b2c3d4e5f6_0"
}

​POST /api/rag/ingest/text

​POST /api/rag/ingest/file

​POST /api/rag/search

​POST /api/rag/query

​Multi-turn query example

​DELETE /api/rag/documents/

POST /api/rag/ingest/text

POST /api/rag/ingest/file

POST /api/rag/search

POST /api/rag/query

Multi-turn query example

DELETE /api/rag/documents/