AI chat and completions API endpoints

The AI Chat API provides two endpoints for interacting with large language models: a multi-turn chat endpoint that supports streaming, and a single-prompt completions endpoint. Both are available on the Pro and Enterprise tiers and are subject to rate limiting. All endpoints are mounted under /api/ai/.

These endpoints are available on Pro and Enterprise plans only. Requests from free-tier users will be rejected with a 403 response.

POST /api/ai/chat

Send a conversation to the configured LLM provider and receive a response. You can choose the provider (openai, anthropic, or gemini) and optionally stream the response as server-sent events. Headers:

Authorization

string

required

Bearer <access_token>

Request body:

messages

object[]

required

An ordered list of messages representing the conversation history. Each message must have a role and content.

Show message properties

role

string

required

The speaker role. One of "system", "user", or "assistant".

content

string

required

The text content of the message.

provider

string

default:"openai"

The LLM provider to use. One of "openai", "anthropic", or "gemini". The provider must be configured with a valid API key in your backend environment.

model

string

The specific model to use (e.g., "gpt-4o", "claude-3-5-sonnet-20241022", "gemini-1.5-pro"). If omitted, the provider’s default model is used.

temperature

number

default:"0.7"

Sampling temperature between 0.0 and 2.0. Lower values produce more deterministic output; higher values increase creativity.

max_tokens

number

default:"1000"

Maximum number of tokens to generate. Must be between 1 and 16384.

stream

boolean

default:"false"

When true, the response is streamed as server-sent events (SSE). Each event contains a token field with the next piece of text. The stream ends with data: [DONE].

Non-streaming example

curl --request POST \
  --url http://localhost:8000/api/ai/chat \
  --header "Authorization: Bearer <access_token>" \
  --header "Content-Type: application/json" \
  --data '{
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain the difference between async and sync Python in one sentence."}
    ],
    "provider": "openai",
    "model": "gpt-4o",
    "temperature": 0.5,
    "max_tokens": 200
  }'

import requests

response = requests.post(
    "http://localhost:8000/api/ai/chat",
    headers={"Authorization": f"Bearer {access_token}"},
    json={
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain the difference between async and sync Python in one sentence."},
        ],
        "provider": "openai",
        "model": "gpt-4o",
        "temperature": 0.5,
        "max_tokens": 200,
    },
)
print(response.json()["content"])

Response — ChatResponse:

content

string

required

The full generated text response from the model.

model

string

required

The model identifier that was used to generate the response.

usage

object

required

Token consumption breakdown for the request.

Show usage properties

prompt_tokens

number

Number of tokens in the input messages.

completion_tokens

number

Number of tokens in the generated response.

total_tokens

number

Total tokens consumed by the request.

{
  "content": "Synchronous Python executes code line by line and blocks until each operation completes, while asynchronous Python uses `async`/`await` to pause and resume coroutines, allowing other tasks to run during waiting periods.",
  "model": "gpt-4o-2024-08-06",
  "usage": {
    "prompt_tokens": 38,
    "completion_tokens": 42,
    "total_tokens": 80
  }
}

Streaming example

Set "stream": true to receive the response token by token as server-sent events. Each event is a JSON object with a token field. The final event is the literal string [DONE].

curl --request POST \
  --url http://localhost:8000/api/ai/chat \
  --header "Authorization: Bearer <access_token>" \
  --header "Content-Type: application/json" \
  --no-buffer \
  --data '{
    "messages": [{"role": "user", "content": "Count to five."}],
    "stream": true
  }'

import requests
import json

with requests.post(
    "http://localhost:8000/api/ai/chat",
    headers={"Authorization": f"Bearer {access_token}"},
    json={"messages": [{"role": "user", "content": "Count to five."}], "stream": True},
    stream=True,
) as response:
    for line in response.iter_lines():
        if line:
            raw = line.decode("utf-8")
            if raw.startswith("data: "):
                payload = raw[6:]
                if payload == "[DONE]":
                    break
                data = json.loads(payload)
                print(data["token"], end="", flush=True)

SSE stream format:

data: {"token": "One"}

data: {"token": ","}

data: {"token": " two"}

data: {"token": ", three, four, five."}

data: [DONE]

POST /api/ai/completions

Generate a single completion from a plain text prompt, without a conversation history. Useful for summarization, classification, code generation, and other single-turn tasks. Headers:

Authorization

string

required

Bearer <access_token>

Request body:

prompt

string

required

The user’s input prompt.

system_prompt

string

An optional system message that sets the model’s behavior for this request (e.g., "You are a JSON formatter.").

provider

string

default:"openai"

The LLM provider to use. One of "openai", "anthropic", or "gemini".

model

string

The specific model to use. If omitted, the provider’s default model is used.

temperature

number

default:"0.7"

Sampling temperature between 0.0 and 2.0.

max_tokens

number

default:"1000"

Maximum number of tokens to generate. Must be between 1 and 16384.

curl --request POST \
  --url http://localhost:8000/api/ai/completions \
  --header "Authorization: Bearer <access_token>" \
  --header "Content-Type: application/json" \
  --data '{
    "prompt": "Summarize the following in one sentence: FastAPI is a modern, fast web framework for building APIs with Python 3.7+ based on standard Python type hints.",
    "system_prompt": "You are a concise technical writer.",
    "provider": "openai",
    "max_tokens": 100
  }'

import requests

response = requests.post(
    "http://localhost:8000/api/ai/completions",
    headers={"Authorization": f"Bearer {access_token}"},
    json={
        "prompt": "Summarize the following in one sentence: FastAPI is a modern, fast web framework...",
        "system_prompt": "You are a concise technical writer.",
        "provider": "openai",
        "max_tokens": 100,
    },
)
print(response.json()["completion"])

Response:

completion

string

required

The generated text response.

model

string

required

The model identifier that produced the response.

usage

object

required

Token consumption breakdown (same shape as ChatResponse.usage).

{
  "completion": "FastAPI is a high-performance Python web framework for building APIs using standard type hints.",
  "model": "gpt-4o-2024-08-06",
  "usage": {
    "prompt_tokens": 52,
    "completion_tokens": 17,
    "total_tokens": 69
  }
}

​POST /api/ai/chat

​Non-streaming example

​Streaming example

​POST /api/ai/completions

POST /api/ai/chat

Non-streaming example

Streaming example

POST /api/ai/completions