Voice & Audio¶
IndoxHub exposes Resemble AI's full voice stack under /api/v1/resemble/*. Auth, billing, rate limiting, and persistence are handled transparently — you call IndoxHub, we forward to Resemble, and every second of audio (or per-image / per-search) is metered against your account.
For the complete endpoint reference, see Resemble AI API reference.
What you can build¶
| Use case | Primary endpoint | Pricing basis |
|---|---|---|
| Read-aloud / IVR / narration | POST /resemble/tts/synthesize |
per second of generated audio |
| Voicemail / call transcription | POST /resemble/stt (async) |
per second of input audio |
| Podcast / recording cleanup | POST /resemble/enhance |
per second of input audio |
| Audio editing (crop, splice, normalize) | POST /resemble/edit |
per second of input audio |
| Deepfake detection for uploads | POST /resemble/detect (audio/video/image) |
per second or per image |
| Audio/video intelligence | POST /resemble/intelligence |
per second or per image |
| Provenance / anti-spoof watermarking | POST /resemble/watermark/apply + /detect |
per second |
| Voice identity lookup | POST /resemble/identity/search |
per search |
| Voice cloning (Business plan) | POST /resemble/voices → /build |
per-voice monthly subscription |
Text-to-Speech¶
The fastest path to working audio. First list voices, then synthesize.
import base64
import requests
API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.indoxhub.com/api/v1"
# 1. Pick a voice
voices = requests.get(
f"{BASE_URL}/resemble/tts/voices?page=1&page_size=10",
headers={"Authorization": f"Bearer {API_KEY}"},
).json()
voice_uuid = voices["items"][0]["uuid"]
# 2. Synthesize
resp = requests.post(
f"{BASE_URL}/resemble/tts/synthesize",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"voice_uuid": voice_uuid,
"text": "Hello from IndoxHub.",
"output_format": "mp3",
},
).json()
# 3. Save the audio
with open("hello.mp3", "wb") as f:
f.write(base64.b64decode(resp["audio_content"]))
print(f"billed: {resp['billing']['charged']} USD for {resp['billing']['quantity']}s")
Billing: $0.0005 / audio_second + your configured markup. The billing block on the response is authoritative — log it.
Speech-to-Text (async)¶
STT is async. Submit a job, then poll or register a webhook.
import time
import requests
API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.indoxhub.com/api/v1"
# 1. Submit the job
job = requests.post(
f"{BASE_URL}/resemble/stt",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"audio_url": "https://example.com/recording.mp3",
"lang": "en",
},
).json()
job_id = job["job_id"]
# 2. Poll until complete (or subscribe to webhooks — see below)
while True:
status = requests.get(
f"{BASE_URL}/resemble/stt/{job_id}",
headers={"Authorization": f"Bearer {API_KEY}"},
).json()
if status["status"] in ("completed", "failed"):
break
time.sleep(5)
print(status.get("transcript"))
Prefer webhooks over polling in production — see Webhooks below.
Audio enhancement¶
Clean up recorded audio (denoise, de-reverb, level). Also async.
job = requests.post(
f"{BASE_URL}/resemble/enhance",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"audio_url": "https://example.com/raw-podcast.wav"},
).json()
# Poll /resemble/enhance/{job_id} or wait for webhook
The same pattern applies to /resemble/edit, /resemble/detect, /resemble/intelligence, /resemble/watermark/apply, /resemble/watermark/detect — submit, then get by job id, with an optional callback_uri.
Webhook-driven completion¶
Polling works for prototypes but wastes request budget in prod. Register a webhook URL in the Resemble dashboard and point it at IndoxHub's public handler:
IndoxHub verifies the HMAC‑SHA256 signature (X-Resemble-Signature), updates the job row, meters billing, and emits an internal completion event. See Resemble Webhooks for the full contract.
If you want your own app to get notified instead of polling IndoxHub, pass callback_uri when you submit the job and point it at your server — IndoxHub will fan out the completion to you after it finishes its own bookkeeping.
Deepfake detection¶
Check an uploaded audio, video, or image for deepfake artifacts before letting it into your pipeline.
resp = requests.post(
f"{BASE_URL}/resemble/detect",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"modality": "audio",
"url": "https://example.com/suspicious.mp3",
},
).json()
# Poll /resemble/detect/{job_id}; final payload carries detection score + verdict.
Use cases: user-generated content moderation, fraud prevention in voice auth, news-media verification.
Voice cloning (Business plan)¶
Voice cloning and voice design require a Resemble Business-plan account. Until RESEMBLE_BUSINESS_PLAN_ACTIVE=true is set, these routes short-circuit with a 503 so you don't burn quota on calls that would fail upstream.
# 1. Create the voice shell
voice = requests.post(
f"{BASE_URL}/resemble/voices",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"name": "My Custom Voice", "consent_text": "I consent…"},
).json()
# 2. Upload training recordings (one or more)
requests.post(
f"{BASE_URL}/resemble/voices/{voice['uuid']}/recordings",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"fields": {"name": "sample-1", "audio_url": "https://…/sample.wav"}},
)
# 3. Kick off training — bills a voice_subscriptions unit
requests.post(
f"{BASE_URL}/resemble/voices/{voice['uuid']}/build",
headers={"Authorization": f"Bearer {API_KEY}"},
)
Once voice["status"] == "built", use its UUID in POST /resemble/tts/synthesize like any other voice.
Building voice chatbots¶
Pair TTS with chat completions for a two-way voice assistant:
import base64
import requests
API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.indoxhub.com/api/v1"
def speak(text: str, voice_uuid: str, out_path: str) -> None:
resp = requests.post(
f"{BASE_URL}/resemble/tts/synthesize",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"voice_uuid": voice_uuid, "text": text, "output_format": "mp3"},
).json()
with open(out_path, "wb") as f:
f.write(base64.b64decode(resp["audio_content"]))
def chat(messages: list) -> str:
resp = requests.post(
f"{BASE_URL}/chat/completions",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"model": "openai/gpt-4o-mini", "messages": messages},
).json()
return resp["data"]
voice_uuid = "9fd7430d" # from GET /resemble/tts/voices
history = [{"role": "system", "content": "You are a concise voice assistant."}]
while True:
user = input("You: ")
if user.lower() in ("quit", "exit"):
break
history.append({"role": "user", "content": user})
reply = chat(history)
history.append({"role": "assistant", "content": reply})
speak(reply, voice_uuid, "reply.mp3")
print(f"Bot: {reply} (audio → reply.mp3)")
Same pattern with STT on the input side gives you a full voice-to-voice loop.
Billing & observability¶
Every metered Resemble call writes a row to provider_resemble_usage with the unit, quantity, provider cost, and your charged amount. Pull aggregates via /analytics or query the table directly. Nothing is hidden — the billing block on each response matches the DB row exactly.
Resemble's own usage is reconciled nightly against GET /account/billing_usage with a ±1 % drift alert. See decisions.md for the full policy (markup, storage, BYOK, reconciliation).
Limits & gotchas¶
- Async jobs: STT, audio enhance/edit, detection, intelligence, watermarking, identity enrollment all return a
job_idand complete out-of-band. Don't block an HTTP request on them. - R2 mirror + per-asset retention: IndoxHub mirrors every audio asset to Cloudflare R2 and returns a presigned URL in the
audio_urlresponse field along withexpires_at. The original Resemble URL stays inresemble_urlas fallback. Retention varies by asset class:- TTS / audio enhance / edit / watermark output → 7 days
- STT input / generic uploads → 30 days
- Voice design candidates → 14 days
- Voice-clone source recordings & built voice models → PERMANENT (identity assets)
- Marking voice-clone uploads as permanent: when uploading via
POST /resemble/uploads, passpurpose=voice_clonein the multipart form to land the file undervoice-recordings/(no expiry). Other purposes:stt_input,watermark_input,audio_job_input. Default: 30-day genericuploads/prefix. - Rate limits: IndoxHub enforces per-user sliding-window caps per capability (Redis). If you see
429 Retry-After, back off — a single user can't starve the shared Resemble key. - Business-plan gating: voice cloning / voice design require the flag; plan accordingly.
- No WebSocket streaming TTS on Flex plan. HTTP-stream TTS is the lowest-latency option available today.
See also¶
- Resemble AI API reference — complete endpoint contract
- Chatbots — pair voice with chat completions
- Document Processing — OCR → summarize → narrate pipeline