Understanding RAG: Why LLMs Need External Knowledge
Large language models are trained on massive text corpora, but their knowledge has a cutoff date and they cannot access your private documents, databases, or real-time information. When you ask GPT-4o about your company's internal policies or last quarter's sales data, it will either hallucinate an answer or honestly say it does not know. Retrieval-Augmented Generation solves this by injecting relevant context into the LLM's prompt at query time.
The RAG architecture has three core stages. First, during the ingestion phase, you process your documents by splitting them into chunks, converting each chunk into a numerical embedding vector, and storing those vectors in a specialized database. Second, during the retrieval phase, when a user asks a question you convert their query into an embedding and search the vector database for the most similar chunks. Third, during the generation phase, you pass those retrieved chunks as context to the LLM alongside the user's question, and the model generates an answer grounded in your actual data.
This pattern is enormously powerful because it separates knowledge storage from reasoning. The LLM handles language understanding and generation while the vector database handles knowledge retrieval. You can update your knowledge base without retraining the model, and you can swap models without rebuilding your retrieval pipeline. RAG has become the default architecture for enterprise AI applications including customer support bots, internal knowledge assistants, legal research tools, and medical information systems.
Document Processing and Chunking Strategies
The quality of your RAG system depends heavily on how you process and chunk your source documents. Garbage in, garbage out applies with full force here. Start by loading your documents using appropriate parsers: PyPDF for PDFs, python-docx for Word files, BeautifulSoup for HTML, and Unstructured for complex layouts with tables and images.
Chunking is where most beginners make their first mistake. Naive approaches split text at fixed character counts, say every 500 characters. This often cuts sentences in half and separates related information. Better strategies include recursive character splitting, which tries to split at paragraph boundaries first, then sentences, then words. Semantic chunking groups text by meaning using embedding similarity, keeping conceptually related content together even if it crosses paragraph boundaries.
Chunk size significantly affects performance. Smaller chunks (200-400 tokens) give more precise retrieval but may lack context. Larger chunks (800-1500 tokens) provide more context but may include irrelevant information that dilutes the signal. A common production pattern is to use smaller chunks for retrieval but expand to surrounding context before passing to the LLM, a technique called contextual chunk headers or parent-child retrieval.
Overlap between chunks is also important. Setting a 50-100 token overlap ensures that information at chunk boundaries is not lost. Additionally, include metadata with each chunk: the source document name, page number, section heading, and creation date. This metadata enables filtered retrieval, for example searching only within a specific document or date range, and helps the LLM cite its sources accurately.
Embeddings and Vector Databases: The Retrieval Engine
Embeddings are the mathematical backbone of RAG. An embedding model converts text into a dense numerical vector, typically 768 to 3072 dimensions, where semantically similar texts have vectors that are close together in the vector space. When a user asks 'What is the refund policy?' and your document contains 'Customers may request a full refund within 30 days,' the embedding vectors for these two texts will be highly similar even though they share few exact words.
Popular embedding models include OpenAI's text-embedding-3-small and text-embedding-3-large, Cohere's embed-v3, and open-source options like BGE-M3 and Nomic Embed. For most applications, OpenAI's small model offers an excellent balance of quality and cost. If you need to run embeddings locally without API calls, BGE-M3 via sentence-transformers is a strong choice.
Vector databases store these embeddings and enable fast similarity search. Chroma is the simplest option for prototyping, running entirely in-process with no external dependencies. Pinecone offers a managed cloud service with excellent scalability. Weaviate provides hybrid search combining vector similarity with keyword matching. Qdrant is a high-performance open-source option written in Rust. For production systems at scale, Pinecone or Qdrant are the most common choices.
The similarity search algorithm matters too. Most vector databases use approximate nearest neighbors (ANN) algorithms like HNSW, which trades a tiny amount of accuracy for dramatically faster search. For a million-document corpus, HNSW can return results in milliseconds compared to seconds for brute-force search. Configure the ef_construction and M parameters based on your accuracy-speed tradeoff requirements.
Building the Retrieval-Generation Pipeline
With documents chunked and embedded, you can build the complete RAG pipeline. The retrieval step converts the user query into an embedding, searches the vector database, and returns the top-k most similar chunks. A typical value for k is 3 to 5 chunks, though this depends on chunk size and context window limits.
The retrieved chunks are formatted into a context string and inserted into a prompt template. A well-structured RAG prompt includes a system message instructing the LLM to answer only based on the provided context, the context chunks themselves, and the user's question. For example: 'You are a helpful assistant. Answer the user question based only on the following context. If the context does not contain enough information, say so. Context: {context} Question: {question}'
Using LangChain, this pipeline can be expressed as a chain: retriever | format_context | prompt_template | llm | output_parser. The retriever is your vector store's as_retriever() method, format_context joins the retrieved documents into a string, and the rest is standard LangChain Expression Language (LCEL).
Critical optimizations for this pipeline include query transformation, where you rephrase the user's question to improve retrieval. For example, converting 'Why is my order late?' to 'order delivery delay causes.' Re-ranking is another powerful technique: retrieve a larger set of candidates, say 20, then use a cross-encoder model to re-score them and keep only the top 5 most relevant. Cohere Rerank and BGE-Reranker are popular choices. These two techniques alone can improve answer quality by 20-30% in production systems.
Evaluating and Improving Your RAG System
Building a RAG system is step one. Making it work well requires systematic evaluation. The three metrics that matter most are faithfulness, relevance, and completeness. Faithfulness measures whether the generated answer is actually supported by the retrieved context, detecting hallucinations. Relevance measures whether the retrieved chunks actually relate to the question. Completeness measures whether the answer addresses all aspects of the question.
The RAGAS framework provides automated evaluation of these metrics. You create a test dataset of questions with ground-truth answers, run your RAG pipeline on them, and RAGAS scores each dimension. Aim for faithfulness above 0.85 and relevance above 0.80 as starting targets.
Common failure modes and their fixes include: low retrieval relevance, which usually means your chunking strategy is wrong or your embedding model is too weak, try semantic chunking or a better embedding model; hallucinations despite good retrieval, which means your prompt template is not constraining the LLM enough, add explicit instructions to refuse when context is insufficient; incomplete answers, which often mean you are retrieving too few chunks, increase k or use parent-child retrieval.
For ongoing monitoring in production, log every retrieval and generation step. Track retrieval latency, embedding costs, LLM token usage, and user feedback. Tools like LangSmith, Langfuse, and Phoenix provide tracing dashboards specifically designed for RAG pipelines. Set up alerts for retrieval quality degradation, which can happen when your knowledge base grows and older embeddings become stale. Regular re-indexing and evaluation against a golden test set will keep your system performing well over time.
Code Example
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
# Chunk documents and create vector store
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)
vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# Build the RAG chain
template = "Answer based on context:\n{context}\nQuestion: {question}"
prompt = ChatPromptTemplate.from_template(template)
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt | ChatOpenAI(model="gpt-4o")
)
response = chain.invoke("What is the refund policy?")