You want an AI system that knows about your company’s products, follows your coding standards, or answers questions from your internal documentation. The model does not know any of this out of the box. You have three options to add domain knowledge, and picking the wrong one wastes months and thousands of dollars.
The Three Approaches
- Prompt Engineering: Craft instructions and examples in the prompt itself. No model changes, no infrastructure.
- RAG (Retrieval-Augmented Generation): Retrieve relevant documents at query time and inject them into the prompt. The model reads your data on the fly.
- Fine-Tuning: Train the model on your data to permanently alter its behavior and knowledge. Requires compute, data preparation, and ongoing maintenance.
Decision Matrix
| Factor | Prompt Engineering | RAG | Fine-Tuning |
|---|---|---|---|
| Setup Cost | Zero | Medium (vector DB, embedding pipeline) | High (compute, data prep, evaluation) |
| Per-Query Cost | Low-Medium (longer prompts) | Medium (retrieval + generation) | Low (no retrieval step) |
| Accuracy | Good for simple tasks | High (grounded in real documents) | Highest (internalized knowledge) |
| Hallucination Risk | Medium-High | Low (cites sources) | Medium (can still hallucinate) |
| Data Freshness | Static (in the prompt) | Real-time (retrieves latest docs) | Stale (frozen at training time) |
| Maintenance | Update prompts manually | Keep document index updated | Re-train periodically |
| Latency | Lowest | Medium (retrieval adds 100-500ms) | Lowest |
| Knowledge Volume | Small (fits in context window) | Unlimited (retrieve as needed) | Large (trained into weights) |
When to Use Prompt Engineering
- Your domain knowledge fits in the system prompt (under 5,000 tokens)
- You need to change behavior, not add knowledge (tone, format, constraints)
- You are prototyping and need results today
- Your data changes frequently and is small enough to include directly
# Prompt engineering: all knowledge in the system prompt
system_prompt = """You are a customer support agent for AcmeCorp.
Product Information:
- Basic Plan: $9/month, 10 projects, 5GB storage
- Pro Plan: $29/month, unlimited projects, 50GB storage
- Enterprise: Custom pricing, SSO, dedicated support
Policies:
- Refunds: Full refund within 14 days, pro-rated after
- Downgrade: Takes effect at end of billing cycle
- Data export: Available in Settings > Export > Download All
Always be helpful and concise. If unsure, say you'll escalate
to a human agent."""
response = client.messages.create(
model="claude-sonnet-4-6",
system=system_prompt,
messages=[{"role": "user", "content": "Can I get a refund?"}],
)
When to Use RAG
- You have large amounts of domain data (docs, knowledge base, code repos)
- Data changes frequently and must always be current
- You need the model to cite sources and ground responses in facts
- Hallucination is unacceptable (medical, legal, financial use cases)
# RAG pipeline: retrieve relevant docs, then generate
from sentence_transformers import SentenceTransformer
import chromadb
# Step 1: Index your documents (one-time setup)
embedder = SentenceTransformer('all-MiniLM-L6-v2')
db = chromadb.PersistentClient(path="./vectordb")
collection = db.get_or_create_collection("docs")
def index_documents(documents: list[dict]):
for doc in documents:
embedding = embedder.encode(doc["content"]).tolist()
collection.add(
ids=[doc["id"]],
embeddings=[embedding],
documents=[doc["content"]],
metadatas=[{"source": doc["source"], "title": doc["title"]}],
)
# Step 2: Retrieve relevant documents at query time
def retrieve(query: str, top_k: int = 5) -> list[str]:
query_embedding = embedder.encode(query).tolist()
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k,
)
return results["documents"][0]
# Step 3: Generate with retrieved context
def rag_answer(question: str) -> str:
relevant_docs = retrieve(question)
context = "\n\n---\n\n".join(relevant_docs)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="""Answer using ONLY the provided documentation.
If the docs don't contain the answer, say so. Cite the relevant section.""",
messages=[{
"role": "user",
"content": f"Documentation:\n{context}\n\nQuestion: {question}"
}],
)
return response.content[0].text
When to Use Fine-Tuning
- You need the model to adopt a specific style, format, or personality consistently
- You have thousands of high-quality input-output examples
- Latency is critical and you cannot afford retrieval overhead
- The knowledge is stable and does not change frequently
- Prompt engineering cannot reliably achieve the quality you need
# Fine-tuning: prepare training data in JSONL format
# training_data.jsonl
{"messages": [
{"role": "system", "content": "You generate SQL from natural language."},
{"role": "user", "content": "Show me all orders from last month over $100"},
{"role": "assistant", "content": "SELECT * FROM orders WHERE created_at >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month') AND created_at < DATE_TRUNC('month', CURRENT_DATE) AND total_amount > 100 ORDER BY created_at DESC;"}
]}
{"messages": [
{"role": "system", "content": "You generate SQL from natural language."},
{"role": "user", "content": "Count users by country"},
{"role": "assistant", "content": "SELECT country, COUNT(*) as user_count FROM users GROUP BY country ORDER BY user_count DESC;"}
]}
# You need 100-1000+ high-quality examples
# Quality matters more than quantity
# Include edge cases and error handling examples
The Hybrid Approach: Best of All Worlds
In practice, most production systems combine multiple approaches:
# Hybrid: Fine-tuned model + RAG + Prompt Engineering
# 1. Fine-tune for consistent output format and domain vocabulary
# 2. RAG for real-time data retrieval
# 3. Prompt engineering for behavioral constraints
def hybrid_pipeline(question: str) -> str:
# RAG: retrieve relevant context
context = retrieve_documents(question)
# Prompt engineering: behavioral instructions
system = """You are a technical support specialist.
Format responses as:
1. Diagnosis (one sentence)
2. Solution (step-by-step)
3. Prevention (one tip)
Always cite the documentation section you referenced."""
# Fine-tuned model: trained on your support ticket history
response = client.messages.create(
model="ft:claude-sonnet-4-6:acmecorp:support-v3",
system=system,
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}],
)
return response.content[0].text
Decision Flowchart
- Can you fit all necessary context in the prompt? → Start with prompt engineering
- Do you need access to large or changing data? → Add RAG
- Is the model not following your format/style consistently? → Consider fine-tuning
- Is latency critical and retrieval adds too much overhead? → Fine-tune the knowledge in
- Do you need all three? → Fine-tuned model with RAG is the production standard for complex systems
Cost Comparison
| Phase | Prompt Engineering | RAG | Fine-Tuning |
|---|---|---|---|
| Setup | $0 | $100-500 (vector DB, embeddings) | $500-5,000 (compute, data prep) |
| Monthly (1M queries) | $200-500 | $300-800 | $150-400 + hosting |
| Time to first result | Hours | Days | Weeks |
| Iteration speed | Minutes | Hours | Days-Weeks |
Key Takeaways
- Always start with prompt engineering — it is free, fast, and often sufficient
- Add RAG when you need large or dynamic knowledge — documents, knowledge bases, code repos
- Fine-tune when you need consistent style or format — not for adding factual knowledge (use RAG for that)
- RAG reduces hallucination better than fine-tuning — the model cites retrieved documents, not memorized patterns
- Fine-tuning freezes knowledge at training time — your data from January is stale by March
- The hybrid approach wins in production — fine-tuned format + RAG for facts + prompt guardrails
- Measure before you optimize — if prompt engineering gives 95% accuracy, the extra 3% from fine-tuning may not justify the cost
The biggest mistake in AI engineering is reaching for fine-tuning first. It is the most expensive, slowest to iterate, and hardest to maintain approach. Start with prompts, add RAG when you outgrow the context window, and fine-tune only when you have proven that the other approaches cannot achieve the quality you need. Most production systems never need fine-tuning at all.