Intermediate22 min readModule 5: RAG Systems

RAG Systems Tutorial: Build Retrieval-Augmented Generation from Scratch

Retrieval-Augmented Generation (RAG) is the most practical technique for making LLMs work with your own data. This tutorial takes you from zero to a fully functional RAG pipeline, covering embeddings, vector stores, chunking strategies, and the retrieval-generation loop that powers modern AI applications.

Last updated: 2026-03-01

Understanding RAG: Why LLMs Need External Knowledge

Large language models are trained on massive text corpora, but their knowledge has a cutoff date and they cannot access your private documents, databases, or real-time information. When you ask GPT-4o about your company's internal policies or last quarter's sales data, it will either hallucinate an answer or honestly say it does not know. Retrieval-Augmented Generation solves this by injecting relevant context into the LLM's prompt at query time. The RAG architecture has three core stages. First, during the ingestion phase, you process your documents by splitting them into chunks, converting each chunk into a numerical embedding vector, and storing those vectors in a specialized database. Second, during the retrieval phase, when a user asks a question you convert their query into an embedding and search the vector database for the most similar chunks. Third, during the generation phase, you pass those retrieved chunks as context to the LLM alongside the user's question, and the model generates an answer grounded in your actual data. This pattern is enormously powerful because it separates knowledge storage from reasoning. The LLM handles language understanding and generation while the vector database handles knowledge retrieval. You can update your knowledge base without retraining the model, and you can swap models without rebuilding your retrieval pipeline. RAG has become the default architecture for enterprise AI applications including customer support bots, internal knowledge assistants, legal research tools, and medical information systems.

Document Processing and Chunking Strategies

The quality of your RAG system depends heavily on how you process and chunk your source documents. Garbage in, garbage out applies with full force here. Start by loading your documents using appropriate parsers: PyPDF for PDFs, python-docx for Word files, BeautifulSoup for HTML, and Unstructured for complex layouts with tables and images. Chunking is where most beginners make their first mistake. Naive approaches split text at fixed character counts, say every 500 characters. This often cuts sentences in half and separates related information. Better strategies include recursive character splitting, which tries to split at paragraph boundaries first, then sentences, then words. Semantic chunking groups text by meaning using embedding similarity, keeping conceptually related content together even if it crosses paragraph boundaries. Chunk size significantly affects performance. Smaller chunks (200-400 tokens) give more precise retrieval but may lack context. Larger chunks (800-1500 tokens) provide more context but may include irrelevant information that dilutes the signal. A common production pattern is to use smaller chunks for retrieval but expand to surrounding context before passing to the LLM, a technique called contextual chunk headers or parent-child retrieval. Overlap between chunks is also important. Setting a 50-100 token overlap ensures that information at chunk boundaries is not lost. Additionally, include metadata with each chunk: the source document name, page number, section heading, and creation date. This metadata enables filtered retrieval, for example searching only within a specific document or date range, and helps the LLM cite its sources accurately.

Embeddings and Vector Databases: The Retrieval Engine

Embeddings are the mathematical backbone of RAG. An embedding model converts text into a dense numerical vector, typically 768 to 3072 dimensions, where semantically similar texts have vectors that are close together in the vector space. When a user asks 'What is the refund policy?' and your document contains 'Customers may request a full refund within 30 days,' the embedding vectors for these two texts will be highly similar even though they share few exact words. Popular embedding models include OpenAI's text-embedding-3-small and text-embedding-3-large, Cohere's embed-v3, and open-source options like BGE-M3 and Nomic Embed. For most applications, OpenAI's small model offers an excellent balance of quality and cost. If you need to run embeddings locally without API calls, BGE-M3 via sentence-transformers is a strong choice. Vector databases store these embeddings and enable fast similarity search. Chroma is the simplest option for prototyping, running entirely in-process with no external dependencies. Pinecone offers a managed cloud service with excellent scalability. Weaviate provides hybrid search combining vector similarity with keyword matching. Qdrant is a high-performance open-source option written in Rust. For production systems at scale, Pinecone or Qdrant are the most common choices. The similarity search algorithm matters too. Most vector databases use approximate nearest neighbors (ANN) algorithms like HNSW, which trades a tiny amount of accuracy for dramatically faster search. For a million-document corpus, HNSW can return results in milliseconds compared to seconds for brute-force search. Configure the ef_construction and M parameters based on your accuracy-speed tradeoff requirements.

Building the Retrieval-Generation Pipeline

With documents chunked and embedded, you can build the complete RAG pipeline. The retrieval step converts the user query into an embedding, searches the vector database, and returns the top-k most similar chunks. A typical value for k is 3 to 5 chunks, though this depends on chunk size and context window limits. The retrieved chunks are formatted into a context string and inserted into a prompt template. A well-structured RAG prompt includes a system message instructing the LLM to answer only based on the provided context, the context chunks themselves, and the user's question. For example: 'You are a helpful assistant. Answer the user question based only on the following context. If the context does not contain enough information, say so. Context: {context} Question: {question}' Using LangChain, this pipeline can be expressed as a chain: retriever | format_context | prompt_template | llm | output_parser. The retriever is your vector store's as_retriever() method, format_context joins the retrieved documents into a string, and the rest is standard LangChain Expression Language (LCEL). Critical optimizations for this pipeline include query transformation, where you rephrase the user's question to improve retrieval. For example, converting 'Why is my order late?' to 'order delivery delay causes.' Re-ranking is another powerful technique: retrieve a larger set of candidates, say 20, then use a cross-encoder model to re-score them and keep only the top 5 most relevant. Cohere Rerank and BGE-Reranker are popular choices. These two techniques alone can improve answer quality by 20-30% in production systems.

Evaluating and Improving Your RAG System

Building a RAG system is step one. Making it work well requires systematic evaluation. The three metrics that matter most are faithfulness, relevance, and completeness. Faithfulness measures whether the generated answer is actually supported by the retrieved context, detecting hallucinations. Relevance measures whether the retrieved chunks actually relate to the question. Completeness measures whether the answer addresses all aspects of the question. The RAGAS framework provides automated evaluation of these metrics. You create a test dataset of questions with ground-truth answers, run your RAG pipeline on them, and RAGAS scores each dimension. Aim for faithfulness above 0.85 and relevance above 0.80 as starting targets. Common failure modes and their fixes include: low retrieval relevance, which usually means your chunking strategy is wrong or your embedding model is too weak, try semantic chunking or a better embedding model; hallucinations despite good retrieval, which means your prompt template is not constraining the LLM enough, add explicit instructions to refuse when context is insufficient; incomplete answers, which often mean you are retrieving too few chunks, increase k or use parent-child retrieval. For ongoing monitoring in production, log every retrieval and generation step. Track retrieval latency, embedding costs, LLM token usage, and user feedback. Tools like LangSmith, Langfuse, and Phoenix provide tracing dashboards specifically designed for RAG pipelines. Set up alerts for retrieval quality degradation, which can happen when your knowledge base grows and older embeddings become stale. Regular re-indexing and evaluation against a golden test set will keep your system performing well over time.

Code Example

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

# Chunk documents and create vector store
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)
vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# Build the RAG chain
template = "Answer based on context:\n{context}\nQuestion: {question}"
prompt = ChatPromptTemplate.from_template(template)
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt | ChatOpenAI(model="gpt-4o")
)
response = chain.invoke("What is the refund policy?")

Frequently Asked Questions

What is the difference between RAG and fine-tuning?

RAG retrieves external knowledge at query time and injects it into the prompt, while fine-tuning modifies the model's weights with new training data. RAG is better for factual knowledge that changes often; fine-tuning is better for teaching the model a new style or behavior. Most production systems use RAG because it is cheaper and easier to update.

How many documents can a RAG system handle?

Modern vector databases can handle millions of document chunks efficiently. Pinecone and Qdrant scale to billions of vectors. The practical limit is usually cost: embedding and storing a million documents costs roughly $5-20 depending on your embedding model. Retrieval speed remains fast thanks to ANN algorithms.

Which vector database should I use for my first RAG project?

Start with Chroma for local development and prototyping since it requires no infrastructure setup. When you are ready for production, Pinecone offers the simplest managed experience while Qdrant gives you more control with self-hosting options. Choose based on your operational preferences.

How do I handle multi-language documents in RAG?

Use a multilingual embedding model like BGE-M3 or Cohere embed-v3 multilingual. These models map text in different languages to the same vector space, so a question in English can retrieve relevant chunks from Hindi or French documents. Test retrieval quality for each language pair you need to support.

Master This Topic in the GritPaw Masterclass

This tutorial covers the basics. The full Module 5: RAG Systems in our 16-week GenAI & Agentic AI Masterclass goes deeper with hands-on projects, AI-powered tutoring, and voice-based assessment.