IQS | How to Build a RAG System: Technical Guide

What RAG solves and what it doesn't

RAG (Retrieval-Augmented Generation) connects a language model to an external body of knowledge — your documentation, your policies, your history. Instead of retraining the model with your data, you retrieve the relevant fragments for each question and inject them into the prompt as context. The model answers grounded in those fragments rather than in its internal memory.

This solves three problems at once: it reduces hallucinations (the model cites your sources rather than inventing), it keeps information current (change a document and the answer changes, with no retraining), and it gives traceability (you can show which fragment each claim came from). What RAG doesn't solve: if your data is a mess, RAG amplifies the mess. Retrieval quality is the ceiling on answer quality.

The architecture of a RAG pipeline

A RAG system has two phases: an indexing phase (offline, when you ingest documents) and a query phase (online, when the user asks). The components:

Ingestion and chunking: documents are split into manageable fragments (chunks). It's the decision that most impacts final quality.
Embeddings: each chunk is turned into a numeric vector that captures its semantic meaning via an embedding model.
Vector database: the vectors are stored in an index that allows efficient similarity search.
Retrieval: on each query, the question is searched against the index and the most relevant chunks are pulled; ideally combining semantic and keyword search.
Reranking: a second model reorders the retrieved candidates by actual relevance to the question, raising precision.
Grounded generation: the retrieved chunks are assembled into the prompt and the language model writes the answer grounded in them, with citations.

The mental rule: the indexing phase defines what your system can find; the query phase defines how well it finds it. Most RAGs that fail do so at retrieval, not generation — the model wrote well, but you passed it the wrong context.

Ingestion and chunking: the decision that matters most

Chunking seems trivial and is where most systems break. If fragments are too large, they dilute the signal and waste context; too small, and they lose meaning and split ideas in half. The goal is for each chunk to be a self-contained unit of meaning.

Chunking with overlap and structure awareness

Splitting by a fixed number of characters cuts sentences in half. The strategy that works: respect the document's natural boundaries (paragraphs, sections, headings) and add overlap between chunks so context isn't lost at the edges.

python

# Structure-aware chunking, with overlap
def chunk_document(text, target_size=800, overlap=150):
    # Split first on semantic boundaries (paragraphs), not characters
    paragraphs = split_on_headings_and_paragraphs(text)
    chunks, current = [], ""
    for para in paragraphs:
        if len(current) + len(para) <= target_size:
            current += "\n\n" + para
        else:
            chunks.append(current.strip())
            # Carry the tail of the previous chunk as overlap
            current = current[-overlap:] + "\n\n" + para
    if current.strip():
        chunks.append(current.strip())
    return chunks

# Attach metadata to each chunk: which document and section it came from.
# It's the basis of citations and of permission filtering.
def to_records(chunks, doc_id, source, section):
    return [
        {"text": c, "doc_id": doc_id, "source": source, "section": section}
        for c in chunks
    ]

Always store source metadata (document, section, date, permissions) alongside each chunk. You need it for three critical things: showing verifiable citations, filtering by what each user may see, and discarding outdated content. A chunk without provenance is an answer you can't defend.

Embeddings and the vector database

An embedding model turns each chunk into a vector of hundreds or thousands of dimensions that places the text in a semantic space: two texts with similar meaning land close together. To store and search them you don't need an exotic database — PostgreSQL with the pgvector extension is enough for the vast majority of cases, and avoids adding another piece of infrastructure.

sql

-- PostgreSQL + pgvector: storage and similarity search
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE chunks (
  id        bigserial PRIMARY KEY,
  doc_id    text NOT NULL,
  source    text NOT NULL,
  section   text,
  content   text NOT NULL,
  embedding vector(1536)            -- embedding model dimension
);

-- ANN index for fast approximate search at scale
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);

-- Retrieve the 8 chunks closest to the (already embedded) question
SELECT id, content, source, section
FROM chunks
ORDER BY embedding <=> $1        -- cosine distance against the query vector
LIMIT 8;

The distance operator (cosine here) measures how close each chunk is to the question in the semantic space. The HNSW index turns that search into something that scales to millions of vectors with millisecond latency. For very large volumes or specific requirements there are dedicated databases (Qdrant, Weaviate, Milvus), but starting with pgvector is the pragmatic call.

Retrieval: hybrid search and reranking

Semantic search alone has a blind spot: it fails on exact terms with no synonyms — product codes, proper names, internal acronyms. Keyword search (BM25, full-text) is the opposite: excellent with exact terms, blind to meaning. Combining both — hybrid search — retrieves better than either alone.

python

# Hybrid search: combine semantic and keyword retrieval
def hybrid_retrieve(query, k=20):
    q_vector = embed(query)
    semantic = vector_search(q_vector, limit=k)   # pgvector, cosine distance
    keyword  = fulltext_search(query, limit=k)    # BM25 / Postgres tsvector

    # Reciprocal rank fusion (RRF): a chunk ranking high in both
    # lists rises; independent of incompatible score scales.
    scores = {}
    for rank, item in enumerate(semantic):
        scores[item.id] = scores.get(item.id, 0) + 1 / (60 + rank)
    for rank, item in enumerate(keyword):
        scores[item.id] = scores.get(item.id, 0) + 1 / (60 + rank)

    ranked_ids = sorted(scores, key=scores.get, reverse=True)
    return [load_chunk(cid) for cid in ranked_ids[:k]]

Reranking: the step that raises precision most

Initial retrieval prioritizes recall: it brings many candidates (20-30) so nothing relevant is left out. But stuffing 30 chunks into the prompt is expensive and dilutes the signal. A reranking model (cross-encoder) takes the question and each candidate together and produces a relevance score far more precise than vector distance. You reorder and keep only the best 4-6 for the prompt. This step, cheap to add, is one of the biggest boosts to perceived system quality.

Assembling the prompt and generating with grounding

With the final chunks selected, you assemble the prompt: clear instructions, the retrieved context, and the question. The key instruction is to require the model to answer only based on the context and to cite its sources — this turns the model into a reader of your documents, not an oracle that improvises.

text

SYSTEM:
You are an assistant that answers only based on the provided CONTEXT.
Rules:
- If the answer isn't in the context, say exactly: "I don't have that
  information in the available documentation." Do not invent.
- Cite the source of each claim with [source: <name>].
- Be concise and direct.

CONTEXT:
{retrieved_chunks_with_their_source}

QUESTION:
{user_question}

For generation, use a capable, current language model — for example, Anthropic's Claude models — via its API, without retraining it: all the specific knowledge enters through the context on each query. The model provides language understanding and writing; your data provides the truth.

The don't-invent instruction is necessary but not a guarantee on its own. Reinforce it in the design: if retrieval found nothing above a relevance threshold, don't call the model — answer directly that there's no information. A RAG that knows how to say 'I don't know' is more trustworthy than one that always tries to answer.

Evaluation: how to know your RAG works

A RAG without evaluation is a demo with luck. Because the system has two stages, it's measured on two levels: how well it retrieves and how well it answers. Without separating them, you don't know whether an error comes from bringing the wrong context or from writing poorly over the right context.

Retrieval metrics: with a set of questions and their known correct chunks, you measure context recall (did it bring the chunk that contained the answer?) and context precision (what proportion of what was retrieved was relevant?). It's the first thing to measure — if retrieval fails, nothing downstream is saved.
Faithfulness: does the answer hold up on the retrieved context alone, or did the model add things of its own? It's the direct metric against hallucination.
Answer relevance: does it actually answer what was asked? A faithful answer that doesn't address the question is also a failure.
LLM-as-judge evaluation: to scale, an LLM evaluates each answer against the context and question following a rubric. It's calibrated against a hand-annotated set and automates regression on every pipeline change.

Common mistakes building RAG

Skipping chunking and splitting on fixed characters: cuts ideas in half and is the number-one cause of poor retrieval. Respect document structure.
Semantic search only: fails on acronyms, codes, and exact names. Hybrid search with reranking is the standard you should have by default.
Not measuring retrieval separately: without retrieval metrics you optimize blind and blame the model for errors that are retrieval's.
Ignoring data governance: without permission filtering on retrieval, your RAG can leak information to a user who shouldn't see it. Access metadata isn't optional.
Trusting only the don't-hallucinate instruction: combine it with a relevance threshold that cuts generation when there isn't enough context.

Access control is part of the retrieval architecture, not a later add-on. The user-permission filter must be applied on the query to the index — before a chunk ever reaches the prompt. A RAG that retrieves without respecting permissions is a data leak waiting to happen, especially under Law 172-13.

Frequently Asked Questions

RAG or fine-tuning a model with my data?

For the vast majority of enterprise cases, RAG. Fine-tuning changes the model's style or behavior, but it's expensive, requires retraining every time your data changes, and gives no traceability. RAG keeps your data out of the model, updates instantly when you change a document, and lets you cite sources. Always start assuming RAG; consider fine-tuning only if you need very specific behavior or formatting that prompting can't achieve.

Which vector database should I use?

Start with PostgreSQL + pgvector if you already use Postgres — it covers most cases without adding new infrastructure, supports hybrid search alongside Postgres full-text, and scales to millions of vectors with the HNSW index. Migrate to a dedicated database (Qdrant, Weaviate, Milvus) only when you have specific scale, advanced filtering, or performance requirements pgvector doesn't cover. Don't start with the most complex option.

How do I keep the system from hallucinating?

With three combined layers: an explicit instruction to answer only from the context and admit when it doesn't know; a relevance threshold that cuts generation when retrieval brought nothing close enough; and verifiable citations that make visible which source each claim came from. No single layer is enough, but together they reduce hallucination to manageable levels. The best retrieval is the best defense: give it the right context and the model rarely invents.

How much context (how many chunks) should I pass the model?

Less than intuition suggests. You retrieve many candidates (20-30) so nothing relevant is lost, but after reranking you pass only the best 4-6 to the prompt. More context isn't better: it dilutes the signal, increases per-query cost, and can worsen the answer by burying the key fragment in noise. Reranking quality matters more than chunk count.

How long does it take to build a production RAG?

A working prototype over a bounded set of documents comes up in days. The distance between that prototype and a production system — with automated ingestion, hybrid search, reranking, permission control, continuous evaluation, and edge-case handling — is weeks, not months, if scoped well. The expensive mistake is taking the prototype to production without the evaluation layer: without metrics you don't know whether each change improves or degrades the system.

Want to put AI to work on your company's knowledge, with grounded and traceable answers? We design and implement production RAG systems — from ingestion to evaluation.

Talk to our team

DevOps · Platform Engineering

How to Build a RAG System: AI Over Your Own Data