LlamaIndex: Connect AI to Your Data
LLMs are powerful, but they don't know about your data -- your company documents, your codebase, your research papers. LlamaIndex is a data framework designed to bridge this gap. It makes it straightforward to ingest, structure, index, and query your private data using LLMs.
If LangChain is a general-purpose toolkit for building LLM applications, LlamaIndex is a specialized framework laser-focused on one thing: connecting AI to your data.
LlamaIndex Definition: An open-source data framework for building LLM applications over custom data. It provides tools for ingesting data from any source, structuring it into optimized indices, and querying it using natural language -- making it the go-to framework for RAG (Retrieval-Augmented Generation) pipelines.
Why LlamaIndex?
Building a RAG system from scratch involves many steps: loading documents in different formats, splitting them into chunks, computing embeddings, storing them in a vector database, retrieving relevant chunks at query time, and synthesizing a final answer. LlamaIndex provides clean abstractions for every one of these steps.
Key reasons developers choose LlamaIndex:
- Data-first design -- built specifically for connecting LLMs to data
- 160+ data connectors -- load from PDFs, databases, APIs, Notion, Slack, Google Drive, and more via LlamaHub
- Multiple index types -- vector, keyword, tree, summary, and knowledge graph indices
- Advanced retrieval -- hybrid search, reranking, recursive retrieval, sub-question decomposition
- Simple to start, powerful to scale -- a basic pipeline is 5 lines of code; production pipelines have full customization
Installation and Setup
# Install LlamaIndex core
pip install llama-index
# Install OpenAI integration (default LLM and embedding provider)
pip install llama-index-llms-openai llama-index-embeddings-openai
# Optional: readers for PDF, DOCX, and other file types
pip install llama-index-readers-file
Set your API key:
export OPENAI_API_KEY="your_openai_api_key_here"
Never hardcode API keys in your source files. Use environment variables or a
.envpython-dotenvCore Concepts
LlamaIndex has five key abstractions that form the backbone of every application.
1. Documents
A Document is a container for your source data. It could be a PDF, a webpage, a database row, or an API response. Documents hold the raw text and associated metadata.
from llama_index.core import Document
# Create documents manually
doc = Document(
text="LlamaIndex is a data framework for LLM applications.",
metadata={"source": "docs", "category": "overview"}
)
# Or load documents from files (the most common approach)
from llama_index.core import SimpleDirectoryReader
# Load all supported files from a directory
documents = SimpleDirectoryReader("./my_data").load_data()
print(f"Loaded {len(documents)} documents")
# Load specific file types with recursive directory scanning
documents = SimpleDirectoryReader(
input_dir="./knowledge_base",
recursive=True,
required_exts=[".pdf", ".txt", ".md", ".docx"]
).load_data()
2. Nodes
Nodes are chunks of Documents. When you build an index, LlamaIndex splits your Documents into smaller Nodes that are suitable for embedding and retrieval. Each Node maintains a reference back to its source Document and preserves metadata.
from llama_index.core.node_parser import SentenceSplitter
# Split documents into nodes (chunks)
parser = SentenceSplitter(
chunk_size=1024, # Max characters per chunk
chunk_overlap=200 # Overlap between chunks for context continuity
)
nodes = parser.get_nodes_from_documents(documents)
print(f"Created {len(nodes)} nodes from {len(documents)} documents")
# Inspect a node
print(f"Node text: {nodes[0].text[:100]}...")
print(f"Source doc: {nodes[0].metadata.get('file_name', 'N/A')}")
Chunk size matters. Smaller chunks (256-512 tokens) give more precise retrieval but may lose context. Larger chunks (1024-2048 tokens) preserve more context but may include irrelevant information. Start with 1024 and adjust based on your retrieval quality.
3. Indices
An Index organizes your Nodes for efficient retrieval. The most common type is the VectorStoreIndex, which embeds each Node and stores the vectors for similarity search.
from llama_index.core import VectorStoreIndex
# Build a vector index from documents
# (handles chunking and embedding automatically)
index = VectorStoreIndex.from_documents(documents)
# Or build from pre-computed nodes for more control
index = VectorStoreIndex(nodes)
LlamaIndex supports several index types for different use cases:
| Index Type | How It Works | Best For |
|---|---|---|
| VectorStoreIndex | Embeds nodes, retrieves by semantic similarity | General-purpose semantic search |
| SummaryIndex | Stores all nodes, iterates through them | Small datasets, comprehensive answers |
| TreeIndex | Builds a tree of summaries from leaf nodes | Hierarchical documents, summarization |
| KeywordTableIndex | Extracts keywords for keyword-based lookup | Exact term matching, structured data |
| KnowledgeGraphIndex | Builds a knowledge graph from text entities | Relationship-heavy data, entity queries |
4. Query Engines
A Query Engine wraps an Index and provides a natural language interface. You ask a question in plain English, and it retrieves relevant context, sends it to the LLM, and returns a synthesized answer.
# Create a query engine from the index
query_engine = index.as_query_engine()
# Ask a question
response = query_engine.query("What is LlamaIndex used for?")
print(response)
# Access source nodes that were used to generate the answer
for node in response.source_nodes:
print(f"Source: {node.metadata.get('file_name')} | Score: {node.score:.3f}")
You can customize retrieval and synthesis behavior:
query_engine = index.as_query_engine(
similarity_top_k=5, # Retrieve top 5 most relevant chunks
response_mode="compact", # Compact context before sending to LLM
streaming=True # Stream the response token by token
)
# With streaming
streaming_response = query_engine.query("Explain the architecture.")
for text in streaming_response.response_gen:
print(text, end="", flush=True)
Response modes control how the LLM synthesizes answers from retrieved chunks:
| Response Mode | Behavior |
|---|---|
| compact | Stuffs as many chunks as possible into one LLM call |
| refine | Iterates through chunks, refining the answer with each one |
| tree_summarize | Builds a tree of summaries for long contexts |
| simple_summarize | Truncates context to fit in a single LLM call |
5. Chat Engines
For conversational applications, Chat Engines maintain conversation history and context across multiple turns.
# Create a chat engine with context awareness
chat_engine = index.as_chat_engine(
chat_mode="condense_plus_context",
verbose=True
)
# Multi-turn conversation
response1 = chat_engine.chat("What are the main features of LlamaIndex?")
print(response1)
# Follow-up question (engine remembers context)
response2 = chat_engine.chat("How does it compare to LangChain?")
print(response2)
# Reset conversation history
chat_engine.reset()
condense_plus_context mode: The chat engine first condenses the conversation history and current question into a standalone query, then retrieves relevant context, and finally generates a response. This prevents retrieval quality from degrading as conversations grow longer.
Building a Complete RAG Pipeline
Here is a complete, working RAG system that you can run over your own documents.
# rag_app.py
import os
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
Settings,
StorageContext,
load_index_from_storage
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# Configure global settings
Settings.llm = OpenAI(model="gpt-4o", temperature=0.1)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
PERSIST_DIR = "./storage"
def build_index():
"""Load documents and build the vector index."""
documents = SimpleDirectoryReader(
input_dir="./data",
recursive=True,
required_exts=[".pdf", ".txt", ".md", ".docx"]
).load_data()
print(f"Loaded {len(documents)} documents")
index = VectorStoreIndex.from_documents(documents)
# Persist to disk so we don't rebuild every time
index.storage_context.persist(persist_dir=PERSIST_DIR)
print(f"Index persisted to {PERSIST_DIR}")
return index
def load_index():
"""Load a previously built index from disk."""
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
return load_index_from_storage(storage_context)
def get_index():
"""Get or build the index."""
if os.path.exists(PERSIST_DIR):
print("Loading existing index...")
return load_index()
print("Building new index...")
return build_index()
def main():
index = get_index()
query_engine = index.as_query_engine(
similarity_top_k=5,
response_mode="compact"
)
print("\nRAG System Ready. Type 'quit' to exit.\n")
while True:
question = input("Question: ").strip()
if question.lower() in ("quit", "exit", "q"):
break
response = query_engine.query(question)
print(f"\nAnswer: {response}\n")
# Show source documents
if response.source_nodes:
print("Sources:")
for i, node in enumerate(response.source_nodes, 1):
source = node.metadata.get("file_name", "Unknown")
score = f"{node.score:.3f}" if node.score else "N/A"
print(f" {i}. {source} (relevance: {score})")
print()
if __name__ == "__main__":
main()
Persisting the Index: The
persist()LlamaIndex vs LangChain
Both frameworks are powerful, but they solve different primary problems.
| Aspect | LlamaIndex | LangChain |
|---|---|---|
| Primary focus | Data indexing and retrieval | General LLM application building |
| Key strength | RAG pipelines, document Q&A | Chains, agents, complex workflows |
| Data connectors | 160+ built-in via LlamaHub | Fewer native connectors |
| Index types | Vector, tree, keyword, knowledge graph | Primarily vector-based |
| Query synthesis | Multiple modes (compact, refine, tree_summarize) | Basic context stuffing |
| Learning curve | Lower for data/RAG tasks | Lower for agent/chain tasks |
| Best for | "I want to chat with my documents" | "I want to build complex AI workflows" |
| Composability | Can be used inside LangChain agents | Can use LlamaIndex as a retrieval tool |
They work together. LlamaIndex and LangChain are complementary, not competitors. You can use LlamaIndex as a retrieval tool inside a LangChain agent, or use LangChain's chain abstractions to orchestrate LlamaIndex query engines. Many production systems use both.
Advanced Features
Sub-Question Query Engine
For complex questions that span multiple data sources, the sub-question query engine breaks a question into targeted sub-questions.
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine
# Create tools from different indices
tools = [
QueryEngineTool(
query_engine=financial_index.as_query_engine(),
metadata=ToolMetadata(
name="financials",
description="Contains financial reports and earnings data"
)
),
QueryEngineTool(
query_engine=product_index.as_query_engine(),
metadata=ToolMetadata(
name="products",
description="Contains product documentation and roadmaps"
)
)
]
# Engine automatically decomposes complex questions
engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=tools)
response = engine.query(
"How did Q3 product launches affect revenue growth?"
)
# The engine will ask the product index about Q3 launches
# and the financial index about revenue, then synthesize both
Using External Vector Stores
LlamaIndex integrates with all major vector databases for production deployments.
# ChromaDB example
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_or_create_collection("my_docs")
vector_store = ChromaVectorStore(chroma_collection=collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context
)
Key Takeaways
What You Have Learned:
- LlamaIndex is a data framework for building LLM applications over your private data
- The five core abstractions are Documents, Nodes, Indices, Query Engines, and Chat Engines
- A basic RAG pipeline can be built in under 10 lines of code
- Multiple index types serve different data patterns and retrieval needs
- Indices can be persisted to disk to avoid re-embedding on every restart
- LlamaIndex and LangChain are complementary -- use each for its strengths
Next Steps
Try building a RAG system over your own documents. Start with a small set of PDFs or markdown files and experiment with different chunk sizes, index types, and response modes to see how they affect answer quality.