Skip to content

Voice & Audio

IndoxHub exposes Resemble AI's full voice stack under /api/v1/resemble/*. Auth, billing, rate limiting, and persistence are handled transparently — you call IndoxHub, we forward to Resemble, and every second of audio (or per-image / per-search) is metered against your account.

For the complete endpoint reference, see Resemble AI API reference.

What you can build

Use case Primary endpoint Pricing basis
Read-aloud / IVR / narration POST /resemble/tts/synthesize per second of generated audio
Voicemail / call transcription POST /resemble/stt (async) per second of input audio
Podcast / recording cleanup POST /resemble/enhance per second of input audio
Audio editing (crop, splice, normalize) POST /resemble/edit per second of input audio
Deepfake detection for uploads POST /resemble/detect (audio/video/image) per second or per image
Audio/video intelligence POST /resemble/intelligence per second or per image
Provenance / anti-spoof watermarking POST /resemble/watermark/apply + /detect per second
Voice identity lookup POST /resemble/identity/search per search
Voice cloning (Business plan) POST /resemble/voices/build per-voice monthly subscription

Text-to-Speech

The fastest path to working audio. First list voices, then synthesize.

import base64
import requests

API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.indoxhub.com/api/v1"

# 1. Pick a voice
voices = requests.get(
    f"{BASE_URL}/resemble/tts/voices?page=1&page_size=10",
    headers={"Authorization": f"Bearer {API_KEY}"},
).json()
voice_uuid = voices["items"][0]["uuid"]

# 2. Synthesize
resp = requests.post(
    f"{BASE_URL}/resemble/tts/synthesize",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "voice_uuid": voice_uuid,
        "text": "Hello from IndoxHub.",
        "output_format": "mp3",
    },
).json()

# 3. Save the audio
with open("hello.mp3", "wb") as f:
    f.write(base64.b64decode(resp["audio_content"]))

print(f"billed: {resp['billing']['charged']} USD for {resp['billing']['quantity']}s")

Billing: $0.0005 / audio_second + your configured markup. The billing block on the response is authoritative — log it.


Speech-to-Text (async)

STT is async. Submit a job, then poll or register a webhook.

import time
import requests

API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.indoxhub.com/api/v1"

# 1. Submit the job
job = requests.post(
    f"{BASE_URL}/resemble/stt",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "audio_url": "https://example.com/recording.mp3",
        "lang": "en",
    },
).json()
job_id = job["job_id"]

# 2. Poll until complete (or subscribe to webhooks — see below)
while True:
    status = requests.get(
        f"{BASE_URL}/resemble/stt/{job_id}",
        headers={"Authorization": f"Bearer {API_KEY}"},
    ).json()
    if status["status"] in ("completed", "failed"):
        break
    time.sleep(5)

print(status.get("transcript"))

Prefer webhooks over polling in production — see Webhooks below.


Audio enhancement

Clean up recorded audio (denoise, de-reverb, level). Also async.

job = requests.post(
    f"{BASE_URL}/resemble/enhance",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={"audio_url": "https://example.com/raw-podcast.wav"},
).json()

# Poll /resemble/enhance/{job_id} or wait for webhook

The same pattern applies to /resemble/edit, /resemble/detect, /resemble/intelligence, /resemble/watermark/apply, /resemble/watermark/detect — submit, then get by job id, with an optional callback_uri.


Webhook-driven completion

Polling works for prototypes but wastes request budget in prod. Register a webhook URL in the Resemble dashboard and point it at IndoxHub's public handler:

POST https://api.indoxhub.com/api/v1/resemble/webhooks/resemble

IndoxHub verifies the HMAC‑SHA256 signature (X-Resemble-Signature), updates the job row, meters billing, and emits an internal completion event. See Resemble Webhooks for the full contract.

If you want your own app to get notified instead of polling IndoxHub, pass callback_uri when you submit the job and point it at your server — IndoxHub will fan out the completion to you after it finishes its own bookkeeping.


Deepfake detection

Check an uploaded audio, video, or image for deepfake artifacts before letting it into your pipeline.

resp = requests.post(
    f"{BASE_URL}/resemble/detect",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "modality": "audio",
        "url": "https://example.com/suspicious.mp3",
    },
).json()
# Poll /resemble/detect/{job_id}; final payload carries detection score + verdict.

Use cases: user-generated content moderation, fraud prevention in voice auth, news-media verification.


Voice cloning (Business plan)

Voice cloning and voice design require a Resemble Business-plan account. Until RESEMBLE_BUSINESS_PLAN_ACTIVE=true is set, these routes short-circuit with a 503 so you don't burn quota on calls that would fail upstream.

# 1. Create the voice shell
voice = requests.post(
    f"{BASE_URL}/resemble/voices",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={"name": "My Custom Voice", "consent_text": "I consent…"},
).json()

# 2. Upload training recordings (one or more)
requests.post(
    f"{BASE_URL}/resemble/voices/{voice['uuid']}/recordings",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={"fields": {"name": "sample-1", "audio_url": "https://…/sample.wav"}},
)

# 3. Kick off training — bills a voice_subscriptions unit
requests.post(
    f"{BASE_URL}/resemble/voices/{voice['uuid']}/build",
    headers={"Authorization": f"Bearer {API_KEY}"},
)

Once voice["status"] == "built", use its UUID in POST /resemble/tts/synthesize like any other voice.


Building voice chatbots

Pair TTS with chat completions for a two-way voice assistant:

import base64
import requests

API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.indoxhub.com/api/v1"

def speak(text: str, voice_uuid: str, out_path: str) -> None:
    resp = requests.post(
        f"{BASE_URL}/resemble/tts/synthesize",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={"voice_uuid": voice_uuid, "text": text, "output_format": "mp3"},
    ).json()
    with open(out_path, "wb") as f:
        f.write(base64.b64decode(resp["audio_content"]))

def chat(messages: list) -> str:
    resp = requests.post(
        f"{BASE_URL}/chat/completions",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={"model": "openai/gpt-4o-mini", "messages": messages},
    ).json()
    return resp["data"]

voice_uuid = "9fd7430d"  # from GET /resemble/tts/voices
history = [{"role": "system", "content": "You are a concise voice assistant."}]

while True:
    user = input("You: ")
    if user.lower() in ("quit", "exit"):
        break
    history.append({"role": "user", "content": user})
    reply = chat(history)
    history.append({"role": "assistant", "content": reply})
    speak(reply, voice_uuid, "reply.mp3")
    print(f"Bot: {reply}  (audio → reply.mp3)")

Same pattern with STT on the input side gives you a full voice-to-voice loop.


Billing & observability

Every metered Resemble call writes a row to provider_resemble_usage with the unit, quantity, provider cost, and your charged amount. Pull aggregates via /analytics or query the table directly. Nothing is hidden — the billing block on each response matches the DB row exactly.

Resemble's own usage is reconciled nightly against GET /account/billing_usage with a ±1 % drift alert. See decisions.md for the full policy (markup, storage, BYOK, reconciliation).


Limits & gotchas

  • Async jobs: STT, audio enhance/edit, detection, intelligence, watermarking, identity enrollment all return a job_id and complete out-of-band. Don't block an HTTP request on them.
  • R2 mirror + per-asset retention: IndoxHub mirrors every audio asset to Cloudflare R2 and returns a presigned URL in the audio_url response field along with expires_at. The original Resemble URL stays in resemble_url as fallback. Retention varies by asset class:
    • TTS / audio enhance / edit / watermark output → 7 days
    • STT input / generic uploads → 30 days
    • Voice design candidates → 14 days
    • Voice-clone source recordings & built voice models → PERMANENT (identity assets)
  • Marking voice-clone uploads as permanent: when uploading via POST /resemble/uploads, pass purpose=voice_clone in the multipart form to land the file under voice-recordings/ (no expiry). Other purposes: stt_input, watermark_input, audio_job_input. Default: 30-day generic uploads/ prefix.
  • Rate limits: IndoxHub enforces per-user sliding-window caps per capability (Redis). If you see 429 Retry-After, back off — a single user can't starve the shared Resemble key.
  • Business-plan gating: voice cloning / voice design require the flag; plan accordingly.
  • No WebSocket streaming TTS on Flex plan. HTTP-stream TTS is the lowest-latency option available today.

See also

Documentation last built on May 23, 2026