RAG
When you configure a RAG source in cagent, your agent automatically gains a search tool for that knowledge base. The agent decides when to search, retrieves only relevant information, and uses it to answer questions or complete tasks - all without you manually managing what goes in the prompt.
This guide explains how cagent's RAG system works, when to use it, and how to configure it effectively for your content.
NoteRAG is an advanced feature that requires configuration and tuning. The defaults work well for getting started, but tailoring the configuration to your specific content and use case significantly improves results.
The problem: too much context
Your agent can work with your entire codebase, but it can't fit everything in its context window. Even with 200K token limits, medium-sized projects are too large. Finding relevant code buried in hundreds of files wastes context.
Filesystem tools help agents read files, but the agent has to guess which files to read. It can't search by meaning, only by filename. Ask "find the retry logic" and the agent reads files hoping to stumble on the right code.
Grep finds exact text matches but misses related concepts. Searching "authentication" won't find code using "auth" or "login." You either get hundreds of matches or zero, and grep doesn't understand code structure - it just matches strings anywhere they appear.
RAG indexes your content ahead of time and enables semantic search. The agent searches pre-indexed content by meaning, not exact words. It retrieves only relevant chunks that respect code structure. No wasted context on exploration.
How RAG works in cagent
Configure a RAG source in your cagent config:
rag:
codebase:
docs: [./src, ./pkg]
strategies:
- type: chunked-embeddings
embedding_model: openai/text-embedding-3-small
vector_dimensions: 1536
database: ./code.db
agents:
root:
model: openai/gpt-5
instruction: You are a coding assistant. Search the codebase when needed.
rag: [codebase]When you reference rag: [codebase], cagent:
- At startup - Indexes your documents (first run only, blocks until complete)
- During conversation - Gives the agent a search tool
- When the agent searches - Retrieves relevant chunks and adds them to context
- On file changes - Automatically re-indexes modified files
The agent decides when to search based on the conversation. You don't manage what goes in context - the agent does.
The indexing process
On first run, cagent:
- Reads files from configured paths
- Respects
.gitignorepatterns (can be disabled) - Splits documents into chunks
- Creates searchable representations using your chosen strategy
- Stores everything in a local database
Subsequent runs reuse the index. If files change, cagent detects this and re-indexes only what changed, keeping your knowledge base up to date without manual intervention.
Retrieval strategies
Different content requires different retrieval approaches. cagent supports three strategies, each optimized for different use cases. The defaults work well, but understanding the trade-offs helps you choose the right approach.
Semantic search (chunked-embeddings)
Converts text to vectors that represent meaning, enabling search by concept rather than exact words:
strategies:
- type: chunked-embeddings
embedding_model: openai/text-embedding-3-small
vector_dimensions: 1536
database: ./docs.db
chunking:
size: 1000
overlap: 100During indexing, documents are split into chunks and each chunk is converted to a 1536-dimensional vector by the embedding model. These vectors are essentially coordinates in a high-dimensional space where similar concepts are positioned close together.
When you search for "how do I authenticate users?", your query becomes a vector and the database finds chunks with nearby vectors using cosine similarity (measuring the angle between vectors). The embedding model learned that "authentication," "auth," and "login" are related concepts, so searching for one finds the others.
Example: The query "how do I authenticate users?" finds both "User authentication requires a valid API token" and "Token-based auth validates requests" despite different wording. It won't find "The authentication tests are failing" because that's a different meaning despite containing the word.
This works well for documentation where users ask questions using different terminology than your docs. The downside is it may miss exact technical terms and sometimes you want literal matches, not semantic ones. Requires embedding API calls during indexing.
Keyword search (BM25)
Statistical algorithm that matches and ranks by term frequency and rarity:
strategies:
- type: bm25
database: ./bm25.db
k1: 1.5
b: 0.75
chunking:
size: 1000
overlap: 100During indexing, documents are tokenized and the algorithm calculates how often each term appears (term frequency) and how rare it is across all documents (inverse document frequency). The scoring index is stored in a local SQLite database.
When you search for "HandleRequest function", the algorithm finds chunks containing these exact terms and scores them based on term frequency, term rarity, and document length. Finding "HandleRequest" is scored as more significant than finding common words like "function". Think of it as grep with statistical ranking.
Example: Searching "HandleRequest function" finds func HandleRequest(w http.ResponseWriter, r *http.Request) and "The HandleRequest function
processes incoming requests", but not "process HTTP requests" despite that
being semantically similar.
The k1 parameter (default 1.5) controls how much repeated terms matter -
higher values emphasize repetition more. The b parameter (default 0.75)
controls length normalization - higher values penalize longer documents more.
This is fast, local (no API costs), and predictable for finding function names, class names, API endpoints, and any identifier that appears verbatim. The trade-off is zero understanding of meaning - "RetryHandler" and "retry logic" won't match despite being related. Essential complement to semantic search.
LLM-enhanced semantic search (semantic-embeddings)
Generates semantic summaries with an LLM before embedding, enabling search by what code does rather than what it's called:
strategies:
- type: semantic-embeddings
embedding_model: openai/text-embedding-3-small
chat_model: openai/gpt-5-mini
vector_dimensions: 1536
database: ./code.db
ast_context: true
chunking:
size: 1000
code_aware: trueDuring indexing, code is split using AST structure (functions stay intact),
then the chat_model generates a semantic summary of each chunk. The summary
gets embedded, not the raw code. When you search, your query matches against
these summaries, but the original code is returned.
This solves a problem with regular embeddings: raw code embeddings are
dominated by variable names and implementation details. A function called
processData that implements retry logic won't semantically match "retry". But
when the LLM summarizes it first, the summary explicitly mentions "retry
logic," making it findable.
Example: Consider this code:
func (c *Client) Do(req *Request) (*Response, error) {
for i := 0; i < 3; i++ {
resp, err := c.attempt(req)
if err == nil { return resp, nil }
time.Sleep(time.Duration(1<<i) * time.Second)
}
return nil, errors.New("max retries exceeded")
}The LLM summary is: "Implements exponential backoff retry logic for HTTP requests, attempting up to 3 times with delays of 1s, 2s, 4s before failing."
Searching "retry logic exponential backoff" now finds this code, despite the
code never using those words. The ast_context: true option includes AST
metadata in prompts for better understanding. The code_aware: true chunking
prevents splitting functions mid-implementation.
This approach excels at finding code by behavior in large codebases with inconsistent naming. The trade-off is significantly slower indexing (LLM call per chunk) and higher API costs (both chat and embedding models). Often overkill for well-documented code or simple projects.
Combining strategies with hybrid retrieval
Each strategy has strengths and weaknesses. Combining them captures both semantic understanding and exact term matching:
rag:
knowledge:
docs: [./documentation, ./src]
strategies:
- type: chunked-embeddings
embedding_model: openai/text-embedding-3-small
vector_dimensions: 1536
database: ./vector.db
limit: 20
- type: bm25
database: ./bm25.db
limit: 15
results:
fusion:
strategy: rrf
k: 60
deduplicate: true
limit: 5How fusion works
Both strategies run in parallel, each returning its top candidates (20 and 15
in this example). Fusion combines results using rank-based scoring, removes
duplicates, and returns the top 5 final results. Your agent gets results that
work for both semantic queries ("how do I...") and exact term searches ("find
configure_auth function").
Fusion strategies
RRF (Reciprocal Rank Fusion) is recommended. It combines results based on rank rather than absolute scores, which works reliably when strategies use different scoring scales. No tuning required.
For weighted fusion, you give more importance to one strategy:
fusion:
strategy: weighted
weights:
chunked-embeddings: 0.7
bm25: 0.3This requires tuning for your content. Use it when you know one approach works better for your use case.
Max score fusion takes the highest score across strategies:
fusion:
strategy: maxThis only works if strategies use comparable scoring scales. Simple but less sophisticated than RRF.
Improving retrieval quality
Reranking results
Initial retrieval optimizes for speed. Reranking rescores results with a more sophisticated model for better relevance:
results:
reranking:
model: openai/gpt-5-mini
threshold: 0.3
criteria: |
When scoring relevance, prioritize:
- Official documentation over community content
- Recent information over outdated material
- Practical examples over theoretical explanations
- Code implementations over design discussions
limit: 5The criteria field is powerful - use it to encode domain knowledge about what
makes results relevant for your specific use case. The more specific your
criteria, the better the reranking.
Trade-off: Significantly better results but adds latency and API costs (LLM call for scoring each result).
Chunking configuration
How you split documents dramatically affects retrieval quality. Tailor chunking to your content type. Chunk size is measured in characters (Unicode code points), not tokens.
For documentation and prose, use moderate chunks with overlap:
chunking:
size: 1000
overlap: 100
respect_word_boundaries: trueOverlap preserves context at chunk boundaries. Respecting word boundaries prevents cutting words in half.
For code, use larger chunks with AST-based splitting:
chunking:
size: 2000
code_aware: trueThis keeps functions intact. The code_aware setting uses tree-sitter to
respect code structure.
NoteCurrently only Go is supported; support for additional languages is planned.
For short, focused content like API references:
chunking:
size: 500
overlap: 50Brief sections need less overlap since they're naturally self-contained.
Experiment with these values. If retrieval misses context, increase chunk size or overlap. If results are too broad, decrease chunk size.
Making decisions about RAG
When to use RAG
Use RAG when:
- Your content is too large for the context window
- You want targeted retrieval, not everything at once
- Content changes and needs to stay current
- Agent needs to search across many files
Don't use RAG when:
- Content is small enough to include in agent instructions
- Information rarely changes (consider prompt engineering instead)
- You need real-time data (RAG uses pre-indexed snapshots)
- Content is already in a searchable format the agent can query directly
Choosing retrieval strategies
Use semantic search (chunked-embeddings) for user-facing documentation, content with varied terminology, and conceptual searches where users phrase questions differently than your docs.
Use keyword search (BM25) for code identifiers, function names, API endpoints, error messages, and any content where exact term matching matters. Essential for technical jargon and proper nouns.
Use LLM-enhanced semantic (semantic-embeddings) for code search by functionality, finding implementations by behavior rather than name, or complex technical content requiring deep understanding. Choose this when accuracy matters more than indexing speed.
Use hybrid (multiple strategies) for general-purpose search across mixed content, when you're unsure which approach works best, or for production systems where quality matters most. Maximum coverage at the cost of complexity.
Tuning for your project
Start with defaults, then adjust based on results.
If retrieval misses relevant content:
- Increase
limitin strategies to retrieve more candidates - Adjust
thresholdto be less strict - Increase chunk
sizeto capture more context - Add more retrieval strategies
If retrieval returns irrelevant content:
- Decrease
limitto fewer candidates - Increase
thresholdto be more strict - Add reranking with specific criteria
- Decrease chunk
sizefor more focused results
If indexing is too slow:
- Increase
batch_sizefor fewer API calls - Increase
max_embedding_concurrencyfor parallelism - Consider BM25 instead of embeddings (local, no API)
- Use smaller embedding models
If results lack context:
- Increase chunk
overlap - Increase chunk
size - Use
return_full_content: trueto return entire documents - Add neighboring chunks to results
Further reading
- Configuration reference - Complete RAG options and parameters
- RAG examples - Working configurations for different scenarios
- Tools reference - How RAG search tools work in agent workflows