Embeddings Explained: The Secret to Cheap RAG Applications
Why your RAG app costs $500/month when it should cost $5 (and how to fix it)
Why your RAG app costs $500/month when it should cost $5 (and how to fix it)
I built a RAG app that cost $500/month.
Then I learned about embeddings. Now it costs $5/month.
Same quality. Same users. Same results.
Here's what I wish someone had told me about embeddings from day one.
Imagine you have 10,000 documents. You need to find the right one for each question.
The dumb way: Send all 10,000 documents to GPT-4 every time.
The smart way: Use embeddings.
What are embeddings? They're like a GPS for your documents.
Instead of reading everything, embeddings turn each document into a set of coordinates. When someone asks a question, you just find the documents closest to that question's coordinates.
Let me show you the math that made me want to cry:
Your setup:
The expensive way (No RAG):
Every question: Send all 5,000 articles to GPT-4
Daily cost: 500 × 5,000 × 500 tokens = 1.25B tokens
Monthly cost: 1.25B × 30 × $5/1M = $187,500
The RAG way:
Step 1: Embed all articles (one-time)
Cost: 5,000 × 500 × $0.10/1M = $0.25
Step 2: Each question
- Embed question: 100 tokens × $0.10/1M = $0.00001
- Find 3 relevant articles: Free
- Send only 3 articles to GPT-4: 1,500 tokens × $5/1M = $0.0075
- Total per question: $0.00751
Step 3: Monthly
500 questions/day × 30 days × $0.00751 = $112.65
Wait, that's still expensive! Here's the real optimization:
Instead of: OpenAI ada-002 ($0.10/1M tokens) Use: Voyage AI or Cohere ($0.02/1M tokens)
Savings: 80% on embedding costs
Bad: Embed entire 500-token articles Good: Chunk into 150-token pieces
function smartChunk(text, targetTokens = 150) {
const sentences = text.split(/[.!?]/);
const chunks = [];
let currentChunk = [];
let currentLength = 0;
for (const sentence of sentences) {
const sentenceTokens = sentence.split(' ').length * 1.3;
if (currentLength + sentenceTokens > targetTokens) {
chunks.push(currentChunk.join('. '));
currentChunk = [sentence];
currentLength = sentenceTokens;
} else {
currentChunk.push(sentence);
currentLength += sentenceTokens;
}
}
if (currentChunk.length > 0) {
chunks.push(currentChunk.join('. '));
}
return chunks;
}
Why this matters: Better search results = fewer documents needed = lower LLM costs
Don't just use embeddings. Combine them with keyword search:
async function hybridSearch(query, topK = 3) {
// 1. Get embedding results
const embeddingResults = await vectorSearch(query, topK * 2);
// 2. Get keyword results
const keywordResults = await keywordSearch(query, topK * 2);
// 3. Combine and deduplicate
const allResults = [...embeddingResults, ...keywordResults];
const uniqueResults = deduplicate(allResults);
// 4. Re-rank top 5
return rerank(query, uniqueResults.slice(0, 5)).slice(0, topK);
}
Result: 40% better accuracy = can use smaller/cheaper LLM
Users ask similar questions all the time:
const embeddingCache = new Map();
async function getCachedEmbedding(text) {
const key = hash(text);
if (embeddingCache.has(key)) {
console.log("Cache hit! 🎉");
return embeddingCache.get(key);
}
const embedding = await embedText(text);
embeddingCache.set(key, embedding);
return embedding;
}
Real numbers: 30-50% cache hit rate in production
Don't use: ada-003 (3072 dimensions, $0.13/1M) Use: ada-002 (1536 dimensions, $0.10/1M)
Accuracy difference: <2% Cost difference: 30%
Here's my actual app's monthly costs:
| Component | Before | After | Savings |
|---|---|---|---|
| Embeddings | $50 | $10 | $40 |
| Vector DB | $50 | $0 (self-hosted) | $50 |
| LLM calls | $400 | $5 | $395 |
| Total | $500 | $15 | $485 |
97% cost reduction. Same quality.
Here's the exact setup I use now:
User Query
↓
[Cache Check] ← 30% hit rate = FREE
↓
[Embed Query] ← Voyage AI ($0.02/1M)
↓
[Hybrid Search] ← Vector + Keyword
↓
[Re-rank] ← GPT-4o Mini ($0.30/1M)
↓
[Top 3 Chunks] ← 150 tokens each
↓
[LLM Answer] ← GPT-4o Mini (not GPT-4!)
↓
User Response
Cost per query: $0.0008 (vs $0.015 before)
import { OpenAI } from 'openai';
import { Pinecone } from '@pinecone-database/pinecone';
const pc = new Pinecone();
const index = pc.Index('my-docs');
const openai = new OpenAI();
// 1. Embedding with caching
const embeddingCache = new Map();
async function getEmbedding(text) {
const key = text.slice(0, 100); // Simple hash
if (embeddingCache.has(key)) {
return embeddingCache.get(key);
}
const response = await openai.embeddings.create({
model: 'text-embedding-ada-002',
input: text
});
const embedding = response.data[0].embedding;
embeddingCache.set(key, embedding);
return embedding;
}
// 2. Smart chunking
function chunkDocument(text, targetSize = 150) {
const sentences = text.split(/[.!?]/);
const chunks = [];
let current = [];
let size = 0;
for (const sentence of sentences) {
const tokens = sentence.split(' ').length * 1.3;
if (size + tokens > targetSize && current.length > 0) {
chunks.push(current.join('. ') + '.');
current = [sentence];
size = tokens;
} else {
current.push(sentence);
size += tokens;
}
}
if (current.length > 0) {
chunks.push(current.join('. ') + '.');
}
return chunks;
}
// 3. Hybrid search
async function hybridSearch(query, topK = 3) {
const embedding = await getEmbedding(query);
// Vector search
const vectorResults = await index.query({
vector: embedding,
topK: topK * 2,
includeMetadata: true
});
// Keyword search (simplified)
const keywordResults = await keywordSearch(query, topK * 2);
// Combine and deduplicate
const all = [...vectorResults.matches, ...keywordResults];
const unique = deduplicateByMetadata(all, 'docId');
// Re-rank with cheap LLM
return await rerank(query, unique.slice(0, 5), topK);
}
// 4. Re-ranking
async function rerank(query, candidates, topK) {
const prompt = `
Query: "${query}"
Candidates:
${candidates.map((c, i) => `${i+1}. ${c.metadata.text}`).join('\n')}
Return top ${topK} numbers in order of relevance:
`;
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: prompt }],
max_tokens: 50
});
const ranks = response.choices[0].message.content
.split(',')
.map(n => parseInt(n.trim()) - 1);
return ranks.map(i => candidates[i]).filter(c => c);
}
// 5. Complete RAG query
async function ragQuery(userQuestion) {
// Check cache first
const cached = await checkResponseCache(userQuestion);
if (cached) return cached;
// Get relevant chunks
const relevantChunks = await hybridSearch(userQuestion, 3);
if (relevantChunks.length === 0) {
return "I don't have enough information to answer that.";
}
// Build context
const context = relevantChunks
.map(c => c.metadata.text)
.join('\n\n---\n\n');
// Generate answer with cheap model
const prompt = `
Context:
${context}
Question: ${userQuestion}
Answer based ONLY on the context above. Be concise.
`;
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: 'You are a helpful assistant. Answer concisely based on context.' },
{ role: 'user', content: prompt }
],
max_tokens: 200
});
const answer = response.choices[0].message.content;
// Cache the response
await cacheResponse(userQuestion, answer);
return answer;
}
Bad: Embed 10,000 documents without filtering Good: Pre-filter with keywords, then embed
Bad: GPT-4 for every query ($0.03/query) Good: GPT-4o Mini ($0.0008/query)
Bad: Only vector search Good: Hybrid (vector + keyword)
Bad: Re-embed the same question 100 times Good: Cache embeddings for 30 days
Bad: 1000-token chunks Good: 150-200 token chunks
| Approach | Monthly Cost | Quality |
|---|---|---|
| Send all to GPT-4 | $187,500 | 10/10 |
| Basic RAG | $112 | 9/10 |
| Optimized RAG | $5 | 9/10 |
The optimized RAG uses:
RAG isn't always the answer:
❌ <100 documents - Just send them all ❌ Exact matching - Use traditional search ❌ Real-time data - Embeddings get stale ❌ Simple Q&A - Fine-tuning might be better
If you're building RAG today:
Cost: ~$20/month for 10K docs
Cost: ~$8/month
Cost: ~$2/month
| Service | Cost/1M tokens | Use For |
|---|---|---|
| OpenAI ada-002 | $0.10 | Standard embeddings |
| Voyage AI | $0.02 | Cheaper embeddings |
| GPT-4 | $5.00 | Complex queries only |
| GPT-4o Mini | $0.30 | Most RAG answers |
| Pinecone | $0.096/hr | Managed DB |
| Qdrant | Free | Self-hosted |
Embeddings are the difference between a $500/month RAG app and a $5/month RAG app.
The trick isn't using fancy models. It's:
Your RAG app should be cheap. If it's not, you're doing it wrong.
Ready to optimize your RAG costs? Use our LLM Cost Calculator to see exactly what your setup should cost.
After reading this article, you now understand:
Which model should you actually use? Here's the real answer based on 50+ use cases
The real story behind tokens, how they work, and simple tricks to cut your AI costs by 50%
Join 500+ developers saving money on AI costs. One practical tip every week.