This is the module where everything comes together. You build the complete RAG pipeline: take a user question, embed it, retrieve relevant chunks, inject them into the prompt, and generate a grounded answer with citations.
The RAG Pipeline
import anthropic
from qdrant_client import QdrantClient
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('all-MiniLM-L6-v2')
qdrant = QdrantClient(url="http://localhost:6333")
claude = anthropic.Anthropic()
def rag_answer(question: str) -> dict:
# 1. Embed the query
query_vector = embedder.encode(question).tolist()
# 2. Retrieve relevant chunks
results = qdrant.search(collection_name="docs", query_vector=query_vector, limit=5)
# 3. Build context from retrieved chunks
context_chunks = []
for r in results:
context_chunks.append(f"[Source: {r.payload['title']}]\n{r.payload['content']}")
context = "\n\n---\n\n".join(context_chunks)
# 4. Augment prompt with context
response = claude.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="Answer using ONLY the provided context. Cite sources. If the context does not contain the answer, say so.",
messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}],
)
return {
"answer": response.content[0].text,
"sources": [r.payload['title'] for r in results],
}
Source Attribution
Production RAG must cite its sources. This builds user trust and enables verification. Include source titles, page numbers, and relevance scores in the response.
Edge Cases
- No relevant results: When retrieval returns nothing above the similarity threshold, say "I don't have information about this" instead of hallucinating
- Conflicting sources: When retrieved documents disagree, present both perspectives with citations
- Context overflow: When retrieved chunks exceed the context window, prioritize by relevance score