Skip to main content
Retrieval-Augmented Generation (RAG) combines large language models (LLMs) with external contextual data sources so that responses are both fluent and grounded in up-to-date information. This lesson uses a simple travel-assistant example to show the end-to-end flow and architecture. Example scenario: A user asks: “What are the top 10 places to visit in New York?” The request arrives at an AI application that orchestrates the process:
  1. The application queries a contextual data source — typically a vectorized knowledge base of travel guides, web-scraped pages, or documents — to retrieve the most relevant documents.
  2. Retrieved documents (or document excerpts) are combined with the user’s prompt to form an augmented prompt.
  3. The augmented prompt is sent to an LLM (for example, GPT-4 via the Azure OpenAI Service), which uses both its parametric knowledge and the retrieved, non-parametric context to generate a grounded answer.
  4. The application can surface citations or source links from the retrieved documents to improve traceability and factuality.
A diagram titled "Retrieval-Augmented Generation (RAG)" showing how an AI app interacts with a vectorized contextual data store and a language model, plus training data, to generate responses. A sample prompt/response on the right illustrates the system giving travel recommendations (top NYC attractions) and citing sources.
Key concepts shown in the diagram:
  • LLM (parametric knowledge): general knowledge learned during pretraining, useful for fluency, reasoning, and broad knowledge.
  • Vectorized contextual store (non-parametric knowledge): embeddings-backed index that retrieves up-to-date or domain-specific facts at query time.
  • Orchestration layer: handles embedding queries, retrieval, ranking, prompt assembly (prompt + retrieved context), and invoking the LLM.
  • Grounding and citations: the final LLM output can include explicit citations from the retrieved documents, increasing trustworthiness.
RAG separates a model’s static, pretrained knowledge from dynamic external knowledge stored in a vector database. This modularity lets you update or extend the system’s knowledge by re-indexing or refreshing external documents without retraining the model.
Why use RAG?
  • Keeps answers current with external sources.
  • Improves factual accuracy by grounding model outputs.
  • Enables domain specialization with curated corpora (legal, medical, product manuals).
  • Allows scaling: smaller models plus targeted retrieval can match or beat larger models on certain tasks.
Comparison: parametric vs non-parametric knowledge
Resource TypeRole in RAGExample
Parametric (LLM)Stores general language patterns and world knowledge learned during trainingGPT-4 provides fluent summarization and reasoning
Non-parametric (Vector store)Stores and returns up-to-date, domain-specific documents at query timeTravel guides, product docs, knowledge base articles
Typical RAG orchestration (high-level pseudo-code)
# Pseudocode for RAG-style request handling
def handle_query(user_query):
    # 1) Create embedding for query
    query_vector = embed(user_query)

    # 2) Retrieve top-k relevant docs from vector store
    docs = vector_store.search(query_vector, top_k=5)

    # 3) Rank or filter retrieved docs (optional)
    ranked_docs = rank_documents(docs, user_query)

    # 4) Build augmented prompt (user query + doc excerpts)
    augmented_prompt = assemble_prompt(user_query, ranked_docs)

    # 5) Call LLM with augmented prompt
    response = llm.generate(augmented_prompt)

    # 6) Return response + citations
    return format_with_citations(response, ranked_docs)
Practical considerations
  • Embeddings: Use a consistent embedding model for both documents and queries to ensure meaningful similarity search.
  • Chunking & context windows: Break long documents into chunks sized for the LLM’s context window; include overlap to preserve continuity.
  • Relevance and hallucination mitigation: Rank and filter retrieved passages; include explicit citations so users can verify answers.
  • Latency and cost: Retrieval adds a network/compute step — caching and efficient indexing help reduce latency and cost.
  • Security & privacy: Be cautious about sensitive data in external knowledge bases; apply appropriate access controls and data redaction.
Further reading and references
  • Azure OpenAI Service documentation
  • Vector databases and embeddings: consider providers like Pinecone, Milvus, or open-source options
  • RAG pattern overview and research: search for Retrieval-Augmented Generation and hybrid retrieval + LLM systems
This architecture is widely used for search-augmented assistants, enterprise knowledge helpers, and any application that needs current, verifiable answers while still leveraging LLM reasoning and natural language capabilities.

Watch Video