Show navigationHide navigation

Chat Agent

OctoCMS includes an optional RAG chat agent that lets editors talk to their content in natural language. Ask it to find copy, summarize entries, draft edits, or pull copy out of an uploaded PDF / DOCX. Every change the agent suggests appears as an approval card — nothing is written to your repo until you click Accept.

The chat lives at /cms/chat and is opt-in: the Chat nav link and page are always available in the admin UI. Until you finish setup below, /cms/chat shows an in-page checklist (linked to this guide) instead of the composer. Chat API routes at /api/octocms/agent still return 404 until isAgentEnabled passes — credentials never reach the browser.

The agent works with three chat providers — pick one:

Anthropic Claude (default — recommended for the best tool-use quality)
OpenAI (GPT-4.1 / GPT-4o / etc.)
Local model behind an OpenAI-compatible HTTP endpoint (Ollama, LM Studio, vLLM, llama.cpp server)

Embeddings always run locally (free, in-process) regardless of which chat provider you pick.

What you need#

An API key for your chosen hosted provider — or a running local server.
An OctoCMS install v0.x or later with the chat agent shipped.

You do not need:

A vector database. Embeddings are stored as a committed JSON file in your repo.
A hosted embedding provider. Embeddings run locally (no per-query cost).
A separate deploy or worker. The agent runs inside your existing Next.js app.

1. Install the optional packages#

The chat agent's dependencies are declared as optional peer dependencies of octocms. Install only what your provider needs.

bash# Anthropic
npm install @anthropic-ai/sdk @huggingface/transformers

# OpenAI or local (Ollama / LM Studio / vLLM)
npm install openai @huggingface/transformers

# Plus, if you want DOCX uploads (any provider):
npm install mammoth

# Plus, if you want PDF uploads on OpenAI / local providers
# (Anthropic gets native PDF support via its SDK — skip this for Anthropic):
npm install pdfjs-dist

Package	Used by	Why
`@anthropic-ai/sdk`↗	`'anthropic'`	Claude SDK — chat + tool use + native PDF support
`openai`↗	`'openai'`, `'local'`	OpenAI SDK — also drives any OpenAI-compatible local endpoint
`@huggingface/transformers`↗	embeddings	Local embedding model — required for retrieval inside chat
`mammoth`↗	DOCX uploads (any provider)	DOCX → text extractor
`pdfjs-dist`↗	PDF uploads on OpenAI / local	Server-side text extraction (Anthropic does not need it — uses native pass-through)

If any required package is missing when chat is invoked, the route returns a clear error pointing at the missing package — octocms itself keeps working as a normal CMS.

2. Get a provider API key (or start a local server)#

Anthropic#

Sign in at console.anthropic.com↗.
Open Settings → API Keys → Create Key. Copy the value (starts with sk-ant-…).
Make sure billing is enabled.

OpenAI#

Sign in at platform.openai.com↗.
Open API Keys → Create new secret key. Copy the value (starts with sk-…).
Make sure billing is enabled.

Local (Ollama example)#

bash# Install: https://ollama.com
ollama pull llama3.2:3b
ollama serve              # exposes OpenAI-compatible API at http://localhost:11434/v1

LM Studio, vLLM, and llama.cpp's server all expose the same OpenAI-compatible API — just point baseURL at whichever you run.

One key per deploy. The key is shared by every editor signed into that deploy of the CMS. There is no per-user key UI yet.

3. Add the key to your environment#

Local development#

Add the key for your provider to .env.local at the root of your project:

bash# Anthropic
ANTHROPIC_API_KEY=sk-ant-...

# OpenAI
OPENAI_API_KEY=sk-...

# Local — usually no key needed

Restart npm run dev so Next.js picks up the new variable.

Production (Vercel)#

In the Vercel dashboard, open your project → Settings → Environment Variables.
Add ANTHROPIC_API_KEY (or OPENAI_API_KEY) for the Production environment (and Preview if you want chat on previews).
Redeploy — Vercel only injects new env vars on the next build.

Production (other hosts)#

Set the env var however your platform configures secrets (Fly.io secrets, Railway variables, Docker --env, systemd Environment=, etc.).

4. Configure the agent in `cms/octocms.config.ts`#

All tunable knobs — provider, model, spend cap, per-conversation token limits, attachment limits — live in your project's existing cms/octocms.config.ts, alongside the schema export. The package only ships defaults; you override anything you want.

ts// cms/octocms.config.ts
import type { Config } from '../octocms/types';
// Import from `agent/types` + `agent/defaults` (not the `octocms/agent` barrel) so
// `configInit` / instrumentation do not pull embeddings, search, or admin file I/O.
import type { AgentConfig } from '../octocms/agent/types';
import { defineAgentConfig } from '../octocms/agent/defaults';
import { schema } from './__generated__/schema';

const _typedConfigOctoCMS = schema;
export const configOctoCMS: Config = _typedConfigOctoCMS as Config;
export type OctoConfig = typeof _typedConfigOctoCMS;

export const agentConfig: AgentConfig = defineAgentConfig({
  // Override any defaults here. With no overrides you get:
  // Anthropic Claude Haiku 4.5, $5 budget cap, 100k input / 10k output per turn.
});

defineAgentConfig({...}) shallow-merges your overrides into the package defaults; the provider field is replaced wholesale (it's a discriminated union, so partial merges across types would be invalid).

Provider examples#

ts// Anthropic — uses ANTHROPIC_API_KEY
export const agentConfig = defineAgentConfig({
  provider: {
    type: 'anthropic',
    model: 'claude-haiku-4-5-20251001',
    pricing: { inputPerM: 1, outputPerM: 5, cachedInputPerM: 0.1 },
  },
  totalBudgetUSD: 5,
});

// OpenAI — uses OPENAI_API_KEY
export const agentConfig = defineAgentConfig({
  provider: {
    type: 'openai',
    model: 'gpt-4.1-mini',
    pricing: { inputPerM: 0.4, outputPerM: 1.6, cachedInputPerM: 0.1 },
  },
  totalBudgetUSD: 10,
});

// Local — Ollama / LM Studio / vLLM (no API key, no budget cap)
export const agentConfig = defineAgentConfig({
  provider: {
    type: 'local',
    model: 'llama3.2:3b',
    baseURL: 'http://localhost:11434/v1',
  },
  totalBudgetUSD: 0, // 0 disables the budget cap
});

Full default set#

Field	Default	Meaning
`provider.type`	`'anthropic'`	`'anthropic'` \| `'openai'` \| `'local'`
`provider.model`	`'claude-haiku-4-5-20251001'`	Model ID for the chosen provider
`provider.pricing.inputPerM`	`1`	USD per million input tokens (omit for local)
`provider.pricing.outputPerM`	`5`	USD per million output tokens (omit for local)
`provider.pricing.cachedInputPerM`	`0.1`	USD per million cached input tokens (omit for local)
`provider.apiKeyEnv`	`'ANTHROPIC_API_KEY'` / `'OPENAI_API_KEY'`	Override the env var name if needed
`provider.baseURL`	—	Required for `'local'`; optional override for `'openai'`
`maxInputTokens`	`100_000`	Hard cap per conversation
`maxOutputTokens`	`10_000`	Hard cap per conversation
`maxProposalsPerTurn`	`20`	Max approval cards in a single turn
`maxAttachmentBytes`	`26_214_400` (25 MB)	Per-attachment size limit
`maxAttachmentsPerTurn`	`3`	Attachments per chat turn
`totalBudgetUSD`	`5`	Cumulative deploy spend cap. Set to `0` to disable.

If you change the model, also update provider.pricing to match — the spend counter relies on those numbers.

5. Verify it's working#

Sign into the CMS at /cms.
You should see a new Chat link in the header nav.
Click it — the page loads at /cms/chat. Try "Show me posts about caching" or any question about your content.

If /cms/chat shows the setup screen instead of the composer:

Open Open setup guide on that page (or continue with the sections below).
Confirm agentConfig is exported from cms/octocms.config.ts and picked up by cms/__generated__/configInit.ts.
Confirm the optional packages are installed (npm ls @anthropic-ai/sdk or npm ls openai).
Confirm the right env var is set for your provider and you redeployed / restarted.
Next.js only reads .env.local on dev-server start. Stop and restart npm run dev.

5a. The chat page (Phase 3)#

The chat itself is wired up at /cms/chat. It's a streaming SSE page powered by a Route Handler at src/app/api/octocms/agent/route.ts — but the actual handler lives inside the npm package at octocms/agent/chatApi.ts (chatRoute for POST, chatStatusRoute for GET). Your app's route file is a one-line re-export, scaffolded by octocms init / octocms update:

ts// src/app/api/octocms/agent/route.ts (also app/api/octocms/agent/route.ts depending on your layout)
import '../../../../cms/__generated__/configInit';
export { chatRoute as POST, chatStatusRoute as GET } from 'octocms/agent';

Same shape as the proposal routes — keeping the user app as thin as possible, while the package owns provider dispatch, attachment normalisation, abort propagation, and SSE wire format. Three pieces are worth knowing about:

What the agent can do today (read-only)#

Tool	What the model uses it for
`searchContent(query, k?, collection?)`	Semantic search over `cms/__generated__/embeddings.json` — top-K hits with title / score / excerpt. Always called first.
`listCollections()`	Returns the schema with field types — handy when the user asks "what kind of content do I have?"
`getEntry(id, collection?)`	Reads one entry's full JSON payload (so the model can answer detailed questions).
`proposeEdit(entryId, collection, fieldChanges, reasoning)`	Mutating — emits an approval card with a per-field diff. Nothing is written until you click Accept.
`proposeNewEntry(collection, fields, reasoning)`	Mutating — emits an approval card with a read-only field preview for a brand-new entry.

See "Proposals & approval cards" below for how mutations work.

Provider-agnostic by design#

The same UI works against three chat providers, picked by agentConfig.provider.type:

'anthropic' — uses @anthropic-ai/sdk messages.stream(). Native PDF support.
'openai' — uses openai SDK chat.completions.create({ stream: true }).
'local' — same openai SDK with a custom baseURL (Ollama, LM Studio, vLLM, llama.cpp). No API key required by default.

Each provider has a tiny adapter in octocms/agent/providers/. The adapters lazy-load their SDK so a missing optional peer dep surfaces as a clear chat-stream error, not a module-load crash.

The agent loop#

octocms/agent/chat.ts is a single async generator. Per request it:

Builds a fresh system prompt (octocms/agent/systemPrompt.ts) — today's date + collection list + (optional) recent post bodies as style exemplars.
Streams provider events into normalised ChatEvents the client can render directly (text_delta, tool_use_start/_input_delta/_complete, tool_result, usage, turn_stop, done).
Runs each requested tool sequentially, feeds results back as the next turn's tool message, and loops until the model says end_turn (or hits maxOutputTokens / maxInputTokens / maxTurns).
Records token counts via recordTurn(agentConfig, …) — when cumulative spend crosses totalBudgetUSD, the agent disables itself exactly as if no key were set.

Stateless on the server#

Conversation history lives in the client — every request POSTs the full message array. There is no server-side session state, which means the route works inside a Vercel Function with no extra storage. (Spend is the one exception — counted in-memory per process and reset on cold start; for a hard cap, set a workspace alert in your provider's console alongside totalBudgetUSD.)

Conversation history (browser storage)#

The chat page keeps a conversation list in the left sidebar and saves transcripts to localStorage in this browser only (not synced across devices, users, or GitHub accounts).

Key	Contents
`octocms:chat-sessions`	Up to 50 saved chats (newest first in the sidebar), each with title, wire `history`, transcript `entries`, proposal card state, and usage totals
`octocms:chat-active-id`	Which saved chat is open after refresh

Sidebar rules:

A chat appears in the list only after you send at least one message (empty drafts from New chat / New conversation stay off the list until then).
New chat in the sidebar and New conversation in the top bar both start a fresh draft; your previous chats remain saved.
Click a row to switch chats; use the trash icon to delete (with confirmation). Deleting the open chat opens the newest remaining chat, or a blank draft if none remain.

Resume behaviour: Reloading /cms/chat restores the active chat’s transcript and wire history so the next message continues the thread. Attachment filenames show on old user bubbles, but files are not stored — you cannot re-send uploads from a saved chat without attaching again. If a session ended with budget exceeded, the composer stays disabled until you start New conversation (same as before persistence).

Clearing site data or using a private window removes history. Implementation: octocms/components/Chat/chatStorage.ts, useChatHistory.ts, ChatSidebar.tsx.

Stop button#

While the assistant is streaming, the New conversation button in the top bar is replaced with a red Stop button. Clicking it:

Aborts the in-flight fetch via an AbortController — the SSE connection drops immediately on the wire.
The package-level chatRoute listens to request.signal and short-circuits the agent loop, so no further model calls are made (and no further tokens are billed) for that turn.
The hook keeps whatever assistant text + tool calls have already streamed back, folds them into the wire history, and flips status to 'stopped'. The next message you send picks up the conversation from that point — the model sees the partial assistant reply and can either continue it or take a different direction.

Use this when a small local model is going in circles, when a long tool chain is running over budget, or just to interrupt and rephrase. The Stop button also fires automatically on New conversation (so reset doesn't leak a connection) and on page navigation (useChatStream cleans up on unmount).

Switching providers#

Open cms/octocms.config.ts and change agentConfig.provider. The default is the local LM Studio config (no key, no cost) for fast dev iteration:

ts// dev — local LM Studio with Qwen 2.5 Coder 14B
provider: {
  type: 'local',
  model: 'qwen/qwen2.5-coder-14b',
  baseURL: 'http://localhost:1234/v1',
}

// prod — switch to Anthropic before deploying to Vercel
provider: {
  type: 'anthropic',
  model: 'claude-haiku-4-5-20251001',
  pricing: { inputPerM: 1, outputPerM: 5, cachedInputPerM: 0.1 },
}

Local providers are dev-only on Vercel. A Vercel function can't reach http://localhost:… — production deploys must use Anthropic or OpenAI.

Proposals & approval cards (Phase 4)#

The agent never writes content directly. When you ask it to fix a typo, rewrite a paragraph, or create a new entry, it calls one of two mutating tools and emits a proposal. A card appears in the chat with the proposed change — you click Accept or Reject (with optional reason), and only then does the actual write happen.

How it flows

You ask: "Fix the typo 'recieve' in any post that uses it."
The agent runs searchContent → finds the post → reads it with getEntry → calls proposeEdit with the fixed text.
A Proposed edit card appears inline in the chat:
- One-sentence reasoning from the model (verbatim).
- A per-field diff (DiffHunk for long values, side-by-side for short ones).
- Accept / Reject buttons.
Click Accept — the client calls the acceptProposalAction Server Action (no public endpoint). The server re-validates against the schema (we trust nothing across the SSE → accept boundary), then runs the same saveFile action your editor uses. Embeddings update automatically (Phase 1 hook), public caches revalidate (buildJsons), and the card flips to Accepted ✓ with the saved path. In the admin UI, useChatStream also runs invalidateAfterMutationAsync for the entries domain (awaited fan-out + BroadcastChannel to sibling tabs) so /cms/content and the entry editor refetch before the auto-follow-up turn without router.refresh().
The chat continues with a synthetic system message telling the model the result, so the next turn references reality.

For new entries, proposeNewEntry works the same way but the card shows a read-only field-by-field preview. Acceptance calls newFile(collection) then saveFile with the proposed values.

What the server actually does on Accept

acceptProposal (in octocms/agent/proposals.ts) is stateless:

For an edit: re-reads the live entry, merges proposed fieldChanges over the existing fields, runs validateEntryFields, and calls saveFile(payload, entryPath). If validation fails or saveFile returns fieldErrors, the card surfaces them and the model gets a chance to self-correct.
For a create: calls newFile(collection) for the UUID + draft skeleton, re-reads it, merges proposed fields, then calls saveFile. Returns the new entry path on success.

There is no in-memory queue, no per-conversation server state — the entire proposal payload is shipped over SSE and replayed back on accept. This keeps the flow Vercel-safe.

Where Accept / Reject live

Accept and Reject are Server Actions, not Route Handlers. They live in the package at octocms/admin/actions/agent.ts as acceptProposalAction(proposal) and rejectProposalAction(reason?). useChatStream imports them directly and calls them via the Server Action transport:

tsimport { acceptProposalAction } from 'octocms/admin/actions/agent';
const result = await acceptProposalAction(proposal); // { ok, entryPath } | { ok: false, error, fieldErrors? }

No public /api/octocms/agent/proposals/* endpoint, no thin re-export file in the user app. Each action re-validates the payload (isProposal from octocms/agent/proposals.ts), checks the CMS session, then runs acceptProposal. The chat SSE route (/api/octocms/agent) is still a Route Handler — streaming text/event-stream is what Route Handlers are for.

Per-turn cap

The chat loop enforces agentConfig.maxProposalsPerTurn (default 20). A model that tries to emit more proposals in a single turn gets back an error result on the over-limit calls; the cards already on screen still work. The cap protects against runaway tool-use loops on small models.

Reject

Reject calls the rejectProposalAction Server Action (essentially a no-op acknowledgment — there's nothing to roll back) and continues the conversation with a system note so the model knows not to re-propose the same change. Optionally include a one-line reason — it's surfaced verbatim to the model.

Accept all pending

When ≥ 2 proposals are pending in the same assistant turn, an Accept all pending button appears below the transcript. It accepts each in order, halting on the first failure (so a validation error doesn't avalanche).

What the agent cannot do

Edit cms/schema.json — schema changes stay in the Visual Schema Editor at /cms/model.
Delete entries.
Bulk-update across many entries in one tool call (use repeated proposeEdit instead).

These are out of scope for v1 and can be added later without changing the proposal protocol.

5b. Document upload (PDF + DOCX) — Phase 5#

The chat composer can attach PDF, DOCX, .txt, and .md files. The agent uses them to draft edits to existing entries (e.g. "update the homepage hero copy to match this press release") or to create new ones.

How an attachment travels#

The user picks a file (or drops it on the composer). The Composer enforces agentConfig.maxAttachmentBytes and agentConfig.maxAttachmentsPerTurn client-side.
The chat hook switches to multipart/form-data for that turn — messages is JSON-serialised in one form field, files are appended under files.
The Route Handler at octocms/agent/chatApi.ts (re-exported by your app/api/octocms/agent/route.ts) re-validates the size + count caps, then calls normalizeAttachments(rawAttachments, { supportsNativePdf }). Each provider adapter exposes supportsNativePdf so the route picks the right path.
Per-file dispatch:
- PDF on Anthropic (supportsNativePdf: true) → wrapped as a normalised document_pdf block. The Anthropic adapter maps it to the SDK's { type: 'document', source: { type: 'base64', media_type: 'application/pdf', data: … } } so Claude reads images, tables, and layout natively.
- PDF on OpenAI / local → extracted server-side via pdfjs-dist (legacy build, no worker) and inlined as a text block prefixed with [Attached document: filename.pdf]. Loses images and complex layout — text fidelity only.
- DOCX (any provider) → extracted server-side via mammoth (extractRawText) → text block, same prefix.
- .txt / .md → text block, same prefix.
- Unsupported (PNG, ZIP, etc.) → skipped with a diagnostics entry surfaced inline to the user.
The blocks are appended to the last user message in the wire history. The agent loop sees them as part of that turn — no special path through chat.ts.
The route emits an event: attachments SSE block with the per-file diagnostics; the chat UI shows the OK/skipped list inline above the assistant turn.

Optional peer dependencies#

Both extractors are optional — install only what you need.

Package	Required for
`mammoth`↗	DOCX uploads (any provider)
`pdfjs-dist`↗	PDF uploads on OpenAI / local providers (Anthropic uses native pass-through and needs no library)

bash# Anthropic only — PDFs ride native, only DOCX needs a library
npm install mammoth

# OpenAI or local — both PDF + DOCX need text extraction
npm install mammoth pdfjs-dist

If a peer dep is missing at request time the Route Handler reports the file as skipped with a clear "Install …" message — the chat keeps going, the model just doesn't see that file.

Page matching — find the right entry to update#

When a user uploads a document without naming a target entry, the agent uses findEntryForDocument(documentText, hintUrl?, k?) to suggest candidates:

URL hint match — when the user pasted a URL or path, the tool walks every collection's optional routeTemplate field (e.g. '/blog/[slug]'). If the placeholders extract field values that match an existing entry, that entry is the highest-confidence candidate (matchedBy: 'routeTemplate').
Search fallback — embeds the document text and runs searchContent over the existing index. Returns the top-K hits as matchedBy: 'search'.

routeTemplate is optional and hand-edited in cms/schema.json:

jsonc{
  "collections": {
    "post": {
      "label": "Posts",
      "hasMany": true,
      "routeTemplate": "/blog/[slug]",
      "fields": { "slug": { "label": "Slug", "format": "slug" }, /* … */ }
    }
  }
}

The Visual Schema Editor doesn't surface routeTemplate yet (out of scope for v1) — set it by hand. Without it, the agent falls back to search-based candidates and asks the user which one to update.

Limits#

Cap	Default	Override
Per-attachment size	25 MB	`agentConfig.maxAttachmentBytes`
Attachments per turn	3	`agentConfig.maxAttachmentsPerTurn`

Over-cap files are rejected client-side (composer chip) and re-checked server-side (HTTP 400). Anthropic charges PDFs by page (~1,000–3,000 tokens / page) — a 100-page PDF can blow the conversation budget on its own. Pre-flight in your head: focused 1–10 page documents are the sweet spot.

What's lost in the OpenAI / local PDF path#

pdfjs-dist extracts the visible text layer — that's everything needed for text-heavy editorial documents (briefs, press releases, articles). What it does not preserve:

Images and figures (no OCR — scanned PDFs come out empty).
Table structure (cells get joined with whitespace; rows aren't reliably separated).
Page-level layout cues (headers, footers, footnotes flow into the body text).
Annotations / form fields.

If image and layout fidelity matter, pick the Anthropic provider — Claude reads PDFs as binary documents and reasons over the visual layout directly.

Tool-use quality on small models#

Verified on Qwen 2.5 Coder 14B (single-tool prompts produce valid JSON tool calls). Sub-7B models (Llama 3.2 3B, Phi-3 mini, Qwen 2.5 1.5B) often return malformed JSON or skip tool calls on multi-tool prompts — keep to ≥ 7B coder-tuned models for local. For production editorial work, Claude Haiku 4.5 / GPT-4.1-mini both handle the loop cleanly.

6. Embeddings pipeline#

The chat agent retrieves content via cosine similarity over a committed JSON index — no vector database, no hosted embedding API. Embeddings run locally and offline via @huggingface/transformers↗ using Xenova/bge-small-en-v1.5 (384-dim, ~30 MB ONNX, MIT license).

Search vs. embeddings#

OctoCMS ships four search surfaces, and only the chat agent uses embeddings:

Surface	Index	Embeddings?	Implementation
Public `/api/search`	MiniSearch JSON (built on demand)	No	`octocms/admin/searchRoute.ts` → `octocms/lib/publicSearchIndex.ts`
Admin CommandK palette	Same MiniSearch index	No	`octocms/components/CommandK/CommandK.tsx` → `octocms/admin/actions/search.ts`
MediaManager search input	Client-side `String.includes`	No	`octocms/components/MediaManager/MediaManager.tsx`
Chat agent `searchContent` tool	`cms/__generated__/embeddings.json` (cosine)	Yes	`octocms/agent/search.ts`

In practice:

You only need to run npx octocms embeddings:gen (and install @huggingface/transformers) if you enable the chat agent.
cms/__generated__/embeddings.json only needs to stay fresh for chat retrieval — the other three indexes rebuild themselves automatically.
Skipping the chat agent? Drop @huggingface/transformers from your dependencies and ignore this section entirely.

What gets indexed#

One vector per entry. The text fed to the model is every leaf field in entry.fields plus any companion .md / .mdx content, flattened into a single string with field-name: value lines. Reference fields keep their raw key strings (e.g. "author-abc.json") — no recursive resolution, so indexing stays cheap.

Media entries are indexed too (since the move to a top-level cms/media/ folder). Their title, originalName, and folder fields become searchable text — so a query like "sunset" or "hero image for blog" surfaces matching media assets alongside editorial entries. Image pixel data is not embedded (text only); the binary file isn't read by the indexer.

When it updates#

On save / create / delete — saveFile, newFile, and removeFile call the embeddings hook automatically when the agent is configured. Media writes (uploadMedia, updateMediaMetadata, moveMedia, deleteMedia) call the same hook. Failures are best-effort (logged, never fail the content write); a CI run of octocms embeddings:gen repairs any drift.
Manual rebuild — run npx octocms embeddings:gen after a bulk content import or after enabling the agent for the first time:
```
bashnpx octocms embeddings:gen
```
Re-running on unchanged content is a fast no-op (sha256 hash skip per entry).

Storage#

The store lives at cms/__generated__/embeddings.json and is committed to your repo. At ~5,000 entries it weighs in around 11 MB.

json{
  "model": "Xenova/bge-small-en-v1.5",
  "dim": 384,
  "entries": {
    "cms/content/post/post-abc.json": {
      "hash": "<sha256 of embedding text>",
      "vec": "<base64-encoded Float32Array>"
    }
  }
}

Vectors are base64 to keep the file ASCII-safe; entries are sorted by path for stable Git diffs.

Performance#

Step	Cost
Model cold load (first call after process start)	3–10 s
Embed one entry (warm)	<50 ms
Re-run with no changes	<100 ms (hash skip per entry)
Cosine search over 5,000 vectors	~10 ms

On Vercel, only query embedding runs in the function (one short call per chat turn). Indexing is offline — embeddings.json is built on the developer's machine and committed.

Switching to a different embedding model#

The Embedder interface in octocms/agent/embedder.ts is intentionally tiny — embed(texts) → Float32Array[], plus dim and modelId metadata. Drop in a hosted provider (Voyage, OpenAI) by implementing the interface and registering it via setDefaultEmbedder(). Changing dim or modelId automatically invalidates the on-disk store (records with the wrong dim are re-embedded on the next run).

7. Retrieval — `searchContent`#

The chat agent's searchContent tool is a thin wrapper around a public helper you can also call from your own code (scripts, route handlers, future custom tools). It runs cosine similarity over the committed embeddings store and returns ranked content hits — no vector database, no network calls beyond the one local query embedding.

tsimport { searchContent } from 'octocms/agent';

const hits = await searchContent('caching strategy', {
  k: 10,             // top-K — default 10
  collection: 'post', // optional: restrict to one collection
});

for (const hit of hits) {
  console.log(`[${hit.score.toFixed(3)}] ${hit.collection}/${hit.id} — ${hit.title}`);
  if (hit.excerpt) console.log(`  ${hit.excerpt}`);
}

Return shape#

tstype SearchHit = {
  id: string;          // filename stem (or sys.id for media)
  path: string;        // 'cms/content/post/post-abc.json'
  collection: string;  // sys.type — e.g. 'post'
  score: number;       // cosine similarity in [-1, 1]; higher = better
  title: string;       // entry's entryTitle field, or filename stem fallback
  excerpt: string;     // first non-title text-like field, ≤ 200 chars
};

title mirrors what the entry list UI shows. excerpt is built from entry.fields only (no companion .md read per hit) — fast, but a markdown-only post may have an empty excerpt.

Options#

Option	Default	Meaning
`k`	`10`	Top-K hits to return
`collection`	—	Restrict to a single collection (`sys.type`)
`branch`	active branch	Forwarded to the store reader and per-hit entry reader so cross-branch search is consistent
`embedder`	local singleton	Test seam — defaults to the same `LocalTransformersEmbedder` used by indexing
`noCache`	`false`	Bypass the in-process store cache
`excerptLength`	`200`	Max excerpt length in characters

Caching#

The embeddings store is cached in module scope for ~30 s, keyed by branch. That covers the back-and-forth of a single chat turn (multiple searchContent tool calls share one load) without holding stale data across long-running processes. To pick up just-committed embeddings inside a long-lived process — e.g. a script that edits and immediately re-searches — pass noCache: true or call clearSearchCache().

Latency#

Step	Cost
Query embedding (cold start, first call after process start)	3–10 s
Query embedding (warm)	< 100 ms
Cosine over 5,000 vectors	~10 ms
Per-hit entry payload read	1 disk / 1 GitHub fetch

On Vercel, the model cold-start happens on the first chat request after a function instance spins up — subsequent turns hit the warm path.

What if there are no embeddings?#

If cms/__generated__/embeddings.json doesn't exist (e.g. you've enabled the agent but never run octocms embeddings:gen), searchContent returns []. The chat agent's tool wrapper turns that into a "no results — try running npx octocms embeddings:gen" message instead of hallucinating answers.

How the key is used#

Server-side only. The provider's API key is read by the Next.js server (Route Handler at src/app/api/octocms/agent). It is never sent to the browser.
Feature flag. isAgentEnabled(agentConfig) gates API routes only — whether the configured provider has a key and the deploy is under budget. The key value itself is never serialized.
Admin UI when off. /cms/chat always renders. Without a key (or once the budget is exceeded), you see a short in-admin checklist that links to octocms.gunkin.dev/docs/chat-agent↗ (octocms/agent/chatSetup.ts). /api/octocms/agent still returns 404 until enabled.

See octocms/agent/featureFlag.ts for the check.

Costs and limits#

The default config targets Claude Haiku 4.5 (claude-haiku-4-5-20251001) — the cheapest current Claude tier with full tool-use support — at Anthropic's standard per-token pricing ($1 / M input, $5 / M output, $0.10 / M cached input). OpenAI and local providers use the pricing you configure in provider.pricing (or zero, for local without pricing).

Three layers of caps protect you from runaway turns:

Cap	Default	Override in `cms/octocms.config.ts`
Total spend for the deploy (chat disables when exceeded)	$5	`agentConfig.totalBudgetUSD` (`0` disables cap)
Input tokens per conversation	100,000	`agentConfig.maxInputTokens`
Output tokens per conversation	10,000	`agentConfig.maxOutputTokens`
Proposals per turn	20	`agentConfig.maxProposalsPerTurn`
Attachment size	25 MB	`agentConfig.maxAttachmentBytes`
Attachments per turn	3	`agentConfig.maxAttachmentsPerTurn`

A per-conversation cap hit ends that conversation with a budget reached notice — click New conversation to start over.

The total spend cap is harder: when cumulative spend on the deploy exceeds totalBudgetUSD, the chat API is disabled exactly as if no API key were set — /cms/chat shows the budget setup state and /api/octocms/agent returns 404. To re-enable, raise totalBudgetUSD (or restart / redeploy to reset the in-memory counter — see below).

What $5 buys you on Haiku 4.5#

Scenario	Input	Output	Cost
Hit conversation cap (worst case)	100k	10k	~$0.15
Typical 5-turn conversation	~30k	~3k	~$0.045
Quick "find me posts about X"	~8k	~500	~$0.011
10-page PDF + edit proposal	~25k	~2k	~$0.035

Roughly ~33 worst-case conversations or 100+ typical editorial sessions before the deploy hits its budget. Local providers bypass the cap entirely (totalBudgetUSD: 0).

About the spend counter#

It's in-memory and per-process — every Vercel function instance owns its own counter, and it resets on cold start. This is intentional: a tiny safety net against runaway loops, not a hard accounting guarantee. For a real hard cap, set a workspace budget alert in your provider's console alongside totalBudgetUSD.

Notes on PDFs#

PDF uploads use Claude's native PDF support and are charged per-page (~1,000–3,000 tokens per page). A 100-page PDF can fill the conversation budget on its own — keep documents focused. OpenAI / local providers do not have native PDF support; PDFs are converted to text first.

Notes on local providers#

Small models (Llama 3.2 1B–3B, Qwen 2.5 1.5B, Phi-3-mini) are usable for short retrieval queries but unreliable at multi-turn structured tool calling. Expect more "fall back to text" responses and noticeably worse edit-proposal quality than Claude. Use local for development / privacy-sensitive deployments; pick a hosted provider for production editorial work.

Disabling the agent#

Pick whichever fits:

Soft disable — remove or blank out the provider's API env var and restart / redeploy. /cms/chat shows the setup guide; /api/octocms/agent returns 404.
Hard disable — also npm uninstall @anthropic-ai/sdk openai @huggingface/transformers mammoth to drop the dependencies from your bundle entirely.

Vercel deployment notes#

Works on the free (Hobby) plan with the default 60-second function duration. No vercel.json and no route-handler runtime / maxDuration exports — cacheComponents (enabled in next.config.ts) rejects route-segment runtime, and committing maxDuration: 300 in vercel.json triggers a plan-limit redirect on Hobby. Node is the default runtime, so no override is needed.
Need longer than 60 s? Pro lifts the ceiling to 300 s and Enterprise to 900 s. Raise it per-project in the Vercel dashboard (Project → Settings → Functions → Function Max Duration). The agent UI surfaces a clear message when a turn hits the timeout, so users on the free plan see the limit rather than a stuck page.
The first chat request after a cold start loads a small embedding model (~30 MB) into the function. Expect a one-time 3–10 second delay; subsequent queries finish in <100 ms.
The agent runs entirely in-process. No background workers, no separate service.
Local providers don't work on Vercel out of the box — baseURL would have to point at a reachable model server.

CI guard: `embeddings:check`#

The committed cms/__generated__/embeddings.json is checked in CI by npx octocms embeddings:check, which is part of npm run checks:

bashnpx octocms embeddings:check
# == octocms embeddings:gen && git diff --exit-code -- cms/__generated__/embeddings.json

The check is offline (uses the same local Xenova/bge-small-en-v1.5 model as indexing — no API key) and idempotent: embedAll skips re-embedding entries whose content hash is unchanged, so a clean run is a fast no-op. The first run downloads the ONNX model (~30 MB) into the @huggingface/transformers cache; subsequent runs reuse it.

If the check fails after a content edit, run npx octocms embeddings:gen locally and commit the regenerated file. The save server actions (saveFile / newFile / removeFile) update the embeddings file in the same commit as the content change — drift typically only happens after manual edits to cms/content/ or schema migrations that bypass the editor.

Troubleshooting#

Symptom	Likely cause
No Chat link in the header	The configured provider's API env var is unset, or the optional packages aren't installed in this environment.
`/cms/chat` shows setup even with the key set	Server didn't pick up the new env var — restart `npm run dev` or redeploy. Confirm `agentConfig` is exported and `configInit` runs.
Error mentioning a missing package	An optional peer dep for the chosen provider isn't installed. Run the install command above.
Chat loads but every reply errors	Provider key invalid, expired, or workspace has no billing. Check the key and the provider's console.
`'local'` provider can't connect	`baseURL` is wrong, the local server isn't running, or it's not reachable from where Next.js is running.
First reply takes ~10s	Cold-start embedding-model load. Subsequent replies are fast.
"Budget reached" mid-conversation	Conversation hit the input/output token cap. Start a new conversation, or raise the caps in `cms/octocms.config.ts`.
Chat link disappeared after heavy use	Total spend on the deploy crossed `totalBudgetUSD`. Raise it (or set to `0` to disable), restart / redeploy to reset the in-memory counter, or set a real budget alert in the provider's console.

Rich Text