RAG Systems¶
Build Retrieval-Augmented Generation (RAG) pipelines using IndoxHub embeddings and chat APIs.
Architecture¶
- Index — Embed documents with the embeddings API
- Retrieve — Find relevant documents using vector similarity
- Generate — Pass context + query to chat completions
Step 1: Embed Documents¶
import requests
import numpy as np
API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.indoxhub.com/api/v1"
documents = [
"IndoxHub supports 200+ AI models from 14 providers.",
"BYOK allows using your own provider API keys.",
"Embeddings are cached for faster repeated lookups.",
"Video generation uses async job processing.",
]
response = requests.post(
f"{BASE_URL}/embeddings",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"model": "openai/text-embedding-3-small",
"text": documents
}
)
embeddings = response.json()["data"]
Step 2: Retrieve Relevant Context¶
def search(query, top_k=3):
# Embed the query
resp = requests.post(
f"{BASE_URL}/embeddings",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"model": "openai/text-embedding-3-small", "text": query}
)
query_vec = np.array(resp.json()["data"][0])
# Compute similarities
scores = []
for i, emb in enumerate(embeddings):
doc_vec = np.array(emb)
score = np.dot(query_vec, doc_vec) / (
np.linalg.norm(query_vec) * np.linalg.norm(doc_vec)
)
scores.append((score, i))
scores.sort(reverse=True)
return [documents[i] for _, i in scores[:top_k]]
Step 3: Generate Answer¶
def rag_answer(question):
context_docs = search(question, top_k=2)
context = "\n".join(f"- {doc}" for doc in context_docs)
response = requests.post(
f"{BASE_URL}/chat/completions",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"model": "openai/gpt-4o-mini",
"messages": [
{
"role": "system",
"content": f"Answer based on this context:\n{context}"
},
{"role": "user", "content": question}
],
"temperature": 0.3
}
)
return response.json()["data"]
print(rag_answer("How does BYOK work?"))
Tips¶
- Batch embed — Send multiple texts in one request for efficiency
- Cache embeddings — Store vectors in a database (Pinecone, Qdrant, pgvector)
- Low temperature — Use 0.1–0.3 for factual RAG answers
- Chunk documents — Split long docs into ~500 token chunks before embedding