What RAG solves and what it doesn't
RAG (Retrieval-Augmented Generation) connects a language model to an external body of knowledge — your documentation, your policies, your history. Instead of retraining the model with your data, you retrieve the relevant fragments for each question and inject them into the prompt as context. The model answers grounded in those fragments rather than in its internal memory.
This solves three problems at once: it reduces hallucinations (the model cites your sources rather than inventing), it keeps information current (change a document and the answer changes, with no retraining), and it gives traceability (you can show which fragment each claim came from). What RAG doesn't solve: if your data is a mess, RAG amplifies the mess. Retrieval quality is the ceiling on answer quality.
The architecture of a RAG pipeline
A RAG system has two phases: an indexing phase (offline, when you ingest documents) and a query phase (online, when the user asks). The components:
- Ingestion and chunking: documents are split into manageable fragments (chunks). It's the decision that most impacts final quality.
- Embeddings: each chunk is turned into a numeric vector that captures its semantic meaning via an embedding model.
- Vector database: the vectors are stored in an index that allows efficient similarity search.
- Retrieval: on each query, the question is searched against the index and the most relevant chunks are pulled; ideally combining semantic and keyword search.
- Reranking: a second model reorders the retrieved candidates by actual relevance to the question, raising precision.
- Grounded generation: the retrieved chunks are assembled into the prompt and the language model writes the answer grounded in them, with citations.
Ingestion and chunking: the decision that matters most
Chunking seems trivial and is where most systems break. If fragments are too large, they dilute the signal and waste context; too small, and they lose meaning and split ideas in half. The goal is for each chunk to be a self-contained unit of meaning.
Chunking with overlap and structure awareness
Splitting by a fixed number of characters cuts sentences in half. The strategy that works: respect the document's natural boundaries (paragraphs, sections, headings) and add overlap between chunks so context isn't lost at the edges.
# Structure-aware chunking, with overlap
def chunk_document(text, target_size=800, overlap=150):
# Split first on semantic boundaries (paragraphs), not characters
paragraphs = split_on_headings_and_paragraphs(text)
chunks, current = [], ""
for para in paragraphs:
if len(current) + len(para) <= target_size:
current += "\n\n" + para
else:
chunks.append(current.strip())
# Carry the tail of the previous chunk as overlap
current = current[-overlap:] + "\n\n" + para
if current.strip():
chunks.append(current.strip())
return chunks
# Attach metadata to each chunk: which document and section it came from.
# It's the basis of citations and of permission filtering.
def to_records(chunks, doc_id, source, section):
return [
{"text": c, "doc_id": doc_id, "source": source, "section": section}
for c in chunks
]Embeddings and the vector database
An embedding model turns each chunk into a vector of hundreds or thousands of dimensions that places the text in a semantic space: two texts with similar meaning land close together. To store and search them you don't need an exotic database — PostgreSQL with the pgvector extension is enough for the vast majority of cases, and avoids adding another piece of infrastructure.
-- PostgreSQL + pgvector: storage and similarity search
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE chunks (
id bigserial PRIMARY KEY,
doc_id text NOT NULL,
source text NOT NULL,
section text,
content text NOT NULL,
embedding vector(1536) -- embedding model dimension
);
-- ANN index for fast approximate search at scale
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);
-- Retrieve the 8 chunks closest to the (already embedded) question
SELECT id, content, source, section
FROM chunks
ORDER BY embedding <=> $1 -- cosine distance against the query vector
LIMIT 8;The distance operator (cosine here) measures how close each chunk is to the question in the semantic space. The HNSW index turns that search into something that scales to millions of vectors with millisecond latency. For very large volumes or specific requirements there are dedicated databases (Qdrant, Weaviate, Milvus), but starting with pgvector is the pragmatic call.
Retrieval: hybrid search and reranking
Semantic search alone has a blind spot: it fails on exact terms with no synonyms — product codes, proper names, internal acronyms. Keyword search (BM25, full-text) is the opposite: excellent with exact terms, blind to meaning. Combining both — hybrid search — retrieves better than either alone.
# Hybrid search: combine semantic and keyword retrieval
def hybrid_retrieve(query, k=20):
q_vector = embed(query)
semantic = vector_search(q_vector, limit=k) # pgvector, cosine distance
keyword = fulltext_search(query, limit=k) # BM25 / Postgres tsvector
# Reciprocal rank fusion (RRF): a chunk ranking high in both
# lists rises; independent of incompatible score scales.
scores = {}
for rank, item in enumerate(semantic):
scores[item.id] = scores.get(item.id, 0) + 1 / (60 + rank)
for rank, item in enumerate(keyword):
scores[item.id] = scores.get(item.id, 0) + 1 / (60 + rank)
ranked_ids = sorted(scores, key=scores.get, reverse=True)
return [load_chunk(cid) for cid in ranked_ids[:k]]Reranking: the step that raises precision most
Initial retrieval prioritizes recall: it brings many candidates (20-30) so nothing relevant is left out. But stuffing 30 chunks into the prompt is expensive and dilutes the signal. A reranking model (cross-encoder) takes the question and each candidate together and produces a relevance score far more precise than vector distance. You reorder and keep only the best 4-6 for the prompt. This step, cheap to add, is one of the biggest boosts to perceived system quality.
Assembling the prompt and generating with grounding
With the final chunks selected, you assemble the prompt: clear instructions, the retrieved context, and the question. The key instruction is to require the model to answer only based on the context and to cite its sources — this turns the model into a reader of your documents, not an oracle that improvises.
SYSTEM:
You are an assistant that answers only based on the provided CONTEXT.
Rules:
- If the answer isn't in the context, say exactly: "I don't have that
information in the available documentation." Do not invent.
- Cite the source of each claim with [source: <name>].
- Be concise and direct.
CONTEXT:
{retrieved_chunks_with_their_source}
QUESTION:
{user_question}For generation, use a capable, current language model — for example, Anthropic's Claude models — via its API, without retraining it: all the specific knowledge enters through the context on each query. The model provides language understanding and writing; your data provides the truth.
Evaluation: how to know your RAG works
A RAG without evaluation is a demo with luck. Because the system has two stages, it's measured on two levels: how well it retrieves and how well it answers. Without separating them, you don't know whether an error comes from bringing the wrong context or from writing poorly over the right context.
- Retrieval metrics: with a set of questions and their known correct chunks, you measure context recall (did it bring the chunk that contained the answer?) and context precision (what proportion of what was retrieved was relevant?). It's the first thing to measure — if retrieval fails, nothing downstream is saved.
- Faithfulness: does the answer hold up on the retrieved context alone, or did the model add things of its own? It's the direct metric against hallucination.
- Answer relevance: does it actually answer what was asked? A faithful answer that doesn't address the question is also a failure.
- LLM-as-judge evaluation: to scale, an LLM evaluates each answer against the context and question following a rubric. It's calibrated against a hand-annotated set and automates regression on every pipeline change.
Common mistakes building RAG
- Skipping chunking and splitting on fixed characters: cuts ideas in half and is the number-one cause of poor retrieval. Respect document structure.
- Semantic search only: fails on acronyms, codes, and exact names. Hybrid search with reranking is the standard you should have by default.
- Not measuring retrieval separately: without retrieval metrics you optimize blind and blame the model for errors that are retrieval's.
- Ignoring data governance: without permission filtering on retrieval, your RAG can leak information to a user who shouldn't see it. Access metadata isn't optional.
- Trusting only the don't-hallucinate instruction: combine it with a relevance threshold that cuts generation when there isn't enough context.
Frequently Asked Questions
RAG or fine-tuning a model with my data?
Which vector database should I use?
How do I keep the system from hallucinating?
How much context (how many chunks) should I pass the model?
How long does it take to build a production RAG?
Want to put AI to work on your company's knowledge, with grounded and traceable answers? We design and implement production RAG systems — from ingestion to evaluation.
Talk to our team