Skip to content

RAG Systems

Build Retrieval-Augmented Generation (RAG) pipelines using IndoxHub embeddings and chat APIs.

Architecture

  1. Index — Embed documents with the embeddings API
  2. Retrieve — Find relevant documents using vector similarity
  3. Generate — Pass context + query to chat completions

Step 1: Embed Documents

import requests
import numpy as np

API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.indoxhub.com/api/v1"

documents = [
    "IndoxHub supports 200+ AI models from 14 providers.",
    "BYOK allows using your own provider API keys.",
    "Embeddings are cached for faster repeated lookups.",
    "Video generation uses async job processing.",
]

response = requests.post(
    f"{BASE_URL}/embeddings",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "openai/text-embedding-3-small",
        "text": documents
    }
)
embeddings = response.json()["data"]

Step 2: Retrieve Relevant Context

def search(query, top_k=3):
    # Embed the query
    resp = requests.post(
        f"{BASE_URL}/embeddings",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={"model": "openai/text-embedding-3-small", "text": query}
    )
    query_vec = np.array(resp.json()["data"][0])

    # Compute similarities
    scores = []
    for i, emb in enumerate(embeddings):
        doc_vec = np.array(emb)
        score = np.dot(query_vec, doc_vec) / (
            np.linalg.norm(query_vec) * np.linalg.norm(doc_vec)
        )
        scores.append((score, i))

    scores.sort(reverse=True)
    return [documents[i] for _, i in scores[:top_k]]

Step 3: Generate Answer

def rag_answer(question):
    context_docs = search(question, top_k=2)
    context = "\n".join(f"- {doc}" for doc in context_docs)

    response = requests.post(
        f"{BASE_URL}/chat/completions",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={
            "model": "openai/gpt-4o-mini",
            "messages": [
                {
                    "role": "system",
                    "content": f"Answer based on this context:\n{context}"
                },
                {"role": "user", "content": question}
            ],
            "temperature": 0.3
        }
    )
    return response.json()["data"]

print(rag_answer("How does BYOK work?"))

Tips

  • Batch embed — Send multiple texts in one request for efficiency
  • Cache embeddings — Store vectors in a database (Pinecone, Qdrant, pgvector)
  • Low temperature — Use 0.1–0.3 for factual RAG answers
  • Chunk documents — Split long docs into ~500 token chunks before embedding
Documentation last built on May 23, 2026