Day 7: RAG, LangChain, LangGraph, and the LLM tooling landscape — Overview of Modern Nets

Final day. Want to cover the ecosystem that sits around LLMs. RAG, orchestration frameworks, agents. These are the things you actually use when building products
Starting with RAG because it connects directly to what I learned about embeddings and attention

Give the model knowledge it wasn't trained on

The core idea is simple. LLMs have a knowledge cutoff and can hallucinate. RAG fixes this by retrieving relevant documents at inference time and stuffing them into the context window
Three steps: 1) chunk your documents into pieces, 2) embed each chunk into a vector and store in a vector DB, 3) at query time embed the user's question, find the closest chunks, and pass them to the LLM as context
The retriever is the important part. Bad retrieval means the model gets irrelevant context and produces garbage. Good retrieval means the model has exactly what it needs
Chunking strategy matters a lot. Too small and you lose context. Too big and you dilute the relevant information. Most people use overlapping chunks of 500-1000 tokens

These store embeddings and do fast similarity search. Pinecone, Weaviate, Qdrant, ChromaDB, FAISS. They all do roughly the same thing but with different tradeoffs on scale, speed, and filtering
The similarity search is usually cosine similarity or dot product. Same maths as attention scores. The embedding model matters more than the DB choice in most cases
I already wrote a blog post on vector databases so I won't go deep here. The key thing is they're the backbone of RAG

Naive RAG is just the starting point

Hybrid search: combine dense embeddings with sparse keyword search (BM25). Dense search is good at semantic matching. Sparse search is good at exact keyword matching. Together they cover more ground
Reranking: after retrieval, use a cross-encoder to rerank results. The retriever is fast but rough. The reranker is slow but precise. Cohere Rerank and similar models do this
Query decomposition: break complex questions into sub-questions, retrieve for each, then combine. This helps when the user asks something that spans multiple documents
Agentic RAG: let the LLM decide when to retrieve, what to search for, and whether the results are good enough. If not, it reformulates and tries again. This is where RAG meets agents

The glue layer for LLM applications

LangChain is a framework for chaining LLM calls together. You have prompts, models, output parsers, and chains. A chain is just: take input, format prompt, call LLM, parse output
It also provides abstractions for document loaders, text splitters, vector stores, retrievers. Basically everything you need for RAG in one place
People have mixed feelings about it. The abstraction is heavy and changes fast. But it's the most popular framework and has the biggest ecosystem. Good for prototyping, debatable for production
LCEL (LangChain Expression Language) is their newer API. It's a pipe syntax for chaining: prompt | model | parser. Cleaner than the old chain classes

State machines for LLM workflows

LangGraph builds on LangChain but adds proper state management. Instead of a linear chain, you define a graph with nodes and edges. Each node is an LLM call or tool use. Edges define the flow
The big thing is conditional edges. The LLM output decides which node to go to next. This is how you build agents that can loop, retry, and make decisions
It supports human-in-the-loop workflows where the graph pauses and waits for human input before continuing. Useful for approval steps
Think of it as: LangChain = sequential pipelines, LangGraph = complex workflows with branching and state

Function calling is how LLMs interact with the outside world. The model doesn't execute functions. It outputs structured JSON saying which function to call and with what arguments. Your code executes it and feeds the result back
An agent is an LLM in a loop. It gets a task, decides what tool to use, observes the result, decides the next step. ReAct pattern: Reason, Act, Observe, repeat until done
OpenAI, Anthropic, Google all have their own function calling formats. The concept is the same. Give the model a list of available tools with schemas, let it choose

Standardising how models talk to tools

Anthropic's open protocol for connecting LLMs to external tools and data sources. Instead of every app building its own integration, MCP provides a standard interface
Think of it like USB for LLMs. One protocol, many tools. The model connects to MCP servers that expose tools, resources, and prompts through a standard API
This is pretty new but it's gaining traction. It means you build a tool once and any MCP-compatible model can use it

Fine-tuning vs RAG: fine-tuning bakes knowledge into the weights. RAG keeps it external. RAG is cheaper, easier to update, and doesn't require retraining. Fine-tuning is better for style and format changes
LoRA and QLoRA: parameter-efficient fine-tuning. Instead of updating all weights, you add small trainable matrices. QLoRA quantises the base model to 4-bit and fine-tunes the LoRA adapters. This is how people fine-tune 70B models on consumer GPUs
Prompt engineering is still underrated. Good prompts with few-shot examples often beat fine-tuning for most tasks. Chain of thought prompting connects back to the reasoning models I covered on day 6
Guardrails and safety: output filtering, content moderation, structured output validation. These sit between the model and the user. Important for production but not the fun part

Seven days. Started from BPE tokenization and ended at the full LLM application stack. The core insight is that everything builds on the same transformer attention mechanism. The ecosystem on top is just plumbing to make it useful
What I still want to go deeper on: RoPE and SwiGLU internals, the chain of thought paper, diffusion models, and building something with agentic RAG. But that's for another notebook