LLM Fundamentals and Prompt Engineering Questions
Interviewers assess your understanding of LLM internals to gauge whether you can debug and optimize AI systems effectively. Here are key questions and what strong answers include.
'Explain the transformer attention mechanism and why it matters for LLM applications.' A strong answer explains that self-attention computes relevance scores between all token pairs, enabling the model to understand context. It matters for applications because attention patterns explain why models struggle with very long contexts (quadratic complexity), why the position of information in the prompt affects quality (recency bias), and why some tasks require careful prompt structure.
'What is the difference between temperature 0 and temperature 1, and when would you use each?' Temperature 0 produces deterministic, focused outputs ideal for factual tasks, classification, and extraction where consistency matters. Temperature 0.7-1.0 introduces diversity, useful for creative tasks, brainstorming, and generating multiple alternative outputs. In production, use temperature 0 for tool-calling agents and structured extraction, and moderate temperature for content generation.
'How do you prevent prompt injection attacks?' Layer your defenses: validate and sanitize user input before it enters the prompt, use system messages to establish the model's role and boundaries, implement input classifiers that detect injection patterns, and validate outputs to ensure the model did not deviate from its instructions. No single technique is sufficient; defense in depth is necessary.
'Explain chain-of-thought prompting and when it helps.' Chain-of-thought (CoT) asks the model to reason step by step before giving a final answer. It significantly improves accuracy on math, logic, and multi-step reasoning tasks. Use structured CoT in production where the model outputs reasoning in a separate field that you can log and analyze without showing it to the user. CoT adds token cost but improves reliability on complex tasks.
'How do you evaluate prompt quality in production?' Use automated evaluation with LLM-as-judge against a golden test set, track production metrics like task completion rate and user satisfaction, A/B test prompt changes, and version-control prompts alongside code. Never change a production prompt without evaluation.
RAG Architecture and Implementation Questions
RAG questions are the most common category in AI engineering interviews because RAG is the most deployed AI pattern. Interviewers want to see depth beyond the basics.
'Walk me through how you would design a RAG system for a company with 100,000 documents.' Start with document processing: classify documents by type, implement appropriate parsers, choose chunk sizes based on document structure (smaller for FAQs, larger for technical manuals). Use semantic chunking with metadata (source, date, category). Embed with a model appropriate for your languages. Store in a scalable vector database (Pinecone or Qdrant). Implement hybrid search with BM25 for exact matching plus vector search for semantic matching. Add a re-ranking layer. Design the generation prompt with citation requirements. Set up evaluation and monitoring.
'How do you handle hallucination in RAG systems?' Explain the multi-layered approach: strong retrieval (so the model has accurate context), explicit prompt instructions to answer only from context, citation enforcement, post-generation faithfulness checking with an LLM grader, and confidence scoring that routes uncertain answers to human review. Mention that CRAG and Self-RAG patterns automate this self-correction.
'What is the difference between naive RAG, advanced RAG, and agentic RAG?' Naive RAG is the basic retrieve-generate pipeline. Advanced RAG adds query transformation, hybrid search, re-ranking, and post-processing. Agentic RAG adds decision-making: the system evaluates retrieval quality, reformulates queries, routes between sources, and self-corrects. Describe when each level is appropriate based on accuracy requirements and complexity.
'How do you evaluate RAG system quality?' Describe the three dimensions: retrieval quality (context precision and recall), generation quality (faithfulness and relevance), and end-to-end quality (answer correctness and completeness). Use RAGAS for automated evaluation, maintain a golden test set of 100+ examples, run evaluations in CI/CD, and track metrics in production. Mention that online evaluation using user feedback complements offline evaluation.
'How do you optimize RAG system latency?' Profile each stage, embed queries in advance if possible, use async and parallel retrieval, cache frequent queries semantically, pre-compute re-rankings for common query types, stream the generation response, and consider using faster models for the generation step when appropriate.
Agent Architecture and Framework Questions
Agent questions test your understanding of agentic design patterns and your ability to choose the right architecture for different problems.
'Explain the ReAct pattern and its limitations.' ReAct alternates between reasoning (the model thinks about what to do) and acting (the model calls a tool). The model observes the tool result and decides whether to act again or provide a final answer. Limitations include potential infinite loops, difficulty with tasks requiring long-horizon planning, sensitivity to tool descriptions, and the fact that reasoning and action happen in the same context window, which can grow large for complex tasks.
'When would you choose LangGraph over CrewAI?' LangGraph for workflows requiring precise control flow, custom state management, human-in-the-loop, persistent state across sessions, and deterministic routing. CrewAI for rapid prototyping, workflows that map naturally to team collaboration, and situations where the high-level role-playing abstraction accelerates development. Mention that they are complementary and can be used together.
'How do you design tools for an AI agent?' Keep tools focused (one responsibility per tool), write descriptive docstrings that explain when and how to use the tool, use Pydantic models for complex inputs, implement proper error handling that returns informative messages (not stack traces), validate inputs for security, and test tools independently before integrating with the agent. The quality of tool descriptions is often the biggest factor in agent performance.
'Describe how you would implement human-in-the-loop for an AI agent.' Use LangGraph's interrupt_before mechanism to pause execution at designated nodes, persist the state to a database, present the pending action to a human through a UI, collect their approval or modification, and resume the graph from the checkpoint. Discuss the importance of timeout handling (what happens if the human does not respond) and the UX considerations of showing the agent's reasoning to the approver.
'How do you handle errors in multi-agent systems?' Implement error handling at three levels: tool-level (try-catch with informative error returns), agent-level (retry logic and fallback to alternative tools), and system-level (circuit breakers, graceful degradation, and escalation to human operators). Log errors with full context for debugging. Design the supervisor agent to detect when a worker is stuck and intervene.
System Design and Production Engineering Questions
System design questions are where senior candidates differentiate themselves. Interviewers want to see that you can design complete AI systems that work at scale.
'Design a customer support bot that handles 10,000 queries per day with 95% accuracy.' Walk through the architecture: API layer with FastAPI behind a load balancer, query classification that routes to specialized agents (billing, technical, general), RAG with domain-specific knowledge bases per agent type, human escalation for complex cases, response caching for common queries. Discuss evaluation strategy, monitoring, and how you would measure and maintain the 95% accuracy target. Address scaling: stateless agent instances behind a load balancer with shared state in PostgreSQL.
'How would you reduce the cost of an AI system spending $5,000/month on LLM APIs?' Profile spending by model, endpoint, and feature. Implement model tier routing (70% of queries to a cheaper model). Add semantic caching for repeated queries. Optimize prompts to reduce token usage. Batch similar requests. Evaluate whether open-source models can handle lower-complexity tasks. Set per-user and per-feature budget limits. Each technique typically saves 20-40%, and combining them can reduce costs by 60-80%.
'How do you handle data privacy in an AI application?' Data minimization: only send necessary context to the LLM, strip PII before including documents in prompts. Use data processing agreements with API providers. Implement audit logging for all data access. Consider self-hosted models for the most sensitive data. Discuss GDPR considerations for user conversations and the right to deletion.
'Explain your approach to CI/CD for AI applications.' Version control for code, prompts, and evaluation datasets. CI pipeline runs unit tests, integration tests with mocked LLMs, and evaluation against a golden test set. Gate deployments on evaluation metrics (reject if quality drops below thresholds). Canary deployments that route 5% of traffic to the new version and compare quality metrics. Automated rollback if quality degrades. Monthly re-evaluation of the full test set to catch gradual drift.
'How do you handle model API outages in production?' Implement cascading fallbacks: primary model (GPT-4o) to backup model (Claude) to cached responses to graceful error messages. Use circuit breakers that switch to fallback after detecting repeated failures, avoiding the latency of failed requests. Pre-warm the fallback model connection so switching is instant. Alert the team when fallback is activated.
Behavioral and Portfolio Questions: Showcasing Your Experience
Beyond technical knowledge, interviewers assess your practical experience and problem-solving approach through behavioral questions about past projects.
'Tell me about an AI project you built from scratch.' Structure your answer using the STAR format: Situation (the problem and context), Task (your specific responsibility), Action (technical decisions and implementation), Result (outcomes and metrics). Emphasize architectural decisions: why you chose LangGraph over LangChain, why you used hybrid search instead of pure vector search, how you handled evaluation. Quantify results wherever possible: latency, accuracy, user satisfaction.
'Describe a challenging debugging situation with an AI system.' Interviewers want to see your systematic approach. Describe how you identified the problem (monitoring alert, user report), how you traced it (LangSmith trace, logs), what the root cause was (retrieval returning irrelevant documents due to a chunking issue), and how you fixed and prevented recurrence (better chunking, added evaluation tests). Show that you think in terms of systems, not just code.
'How do you stay current with the rapidly evolving AI engineering landscape?' Mention specific practices: following key researchers and practitioners on Twitter/LinkedIn, reading papers from arXiv (LangChain team's blog, Anthropic's research), participating in community forums (LangChain Discord, AI engineering Slack groups), experimenting with new tools in side projects, and contributing to open source. Demonstrate that you are proactive about learning.
'How do you decide when to use AI versus traditional software for a feature?' AI is appropriate when the task requires understanding natural language, dealing with ambiguous inputs, or generating creative outputs. Traditional software is better for deterministic logic, simple lookups, and well-defined business rules. Discuss the evaluation framework: does the task require AI's flexibility, can traditional software handle it more reliably, and what is the cost-benefit tradeoff? Show that you do not default to AI for everything.
Prepare a 2-minute walkthrough of your best project that covers the problem, your architectural decisions, key implementation challenges, and measurable outcomes. Practice delivering it clearly and concisely. This walkthrough often sets the tone for the entire interview.
Code Example
# Common interview question: implement a simple RAG chain
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
def build_rag_chain(documents: list[str]):
"""Build a production-ready RAG chain — common interview task."""
vectorstore = Chroma.from_texts(documents, OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
prompt = ChatPromptTemplate.from_template(
"Answer based only on this context:\n{context}\n\n"
"Question: {question}\n"
"If unsure, say 'I don't have enough information.'"
)
chain = (
{"context": retriever, "question": lambda x: x}
| prompt
| ChatOpenAI(model="gpt-4o", temperature=0)
| StrOutputParser()
)
return chain