Sam Koelle
May 9, 2023

Using the Uberduck API

In this post, you'll learn to use the Uberduck API to generate speech and song and convert between voices.

This post will cover how to generate speech and singing using text-to-speech and voice-to-voice with the Uberduck API. You can access the API docs here or interactively explore here and here.

Covered in this guide

  1. Getting started with accessing the API
  2. Viewing the list of available voices for text-to-speech and voice-to-voice
  3. Generating speech from text
  4. Converting speech from one voice to another
  5. Generating rap

The examples in this guide are given in Python, but you can access the API using a language of your choosing.

The endpoints accessible via our API

screenshot of api docs

🦆 Let's get quacking!

1. Getting started

Creating an Uberduck account

The first thing you will need to do is sign up for an Uberduck account! Go to the signup page and create an account.

Creating an API key

After you've created an account, go to our account management page.

screenshot of account management page

Create an API key and write down the secret key in a secret place. You can create as many API keys as you'd like, but can only view the secret key when you create it.

uberduck_auth = (YOUR-API-KEY, YOUR-API-SECRET)

You'll now be able to access the API. For unlimited access and access to commercial use voices, purchase a plan at our pricing page.

Checking the API is functional

import requests

2. Viewing available voices

You may then view the list of available voices for text-to-speech and voice-to-voice by making an HTTP GET request to the /voices endpoint.

screenshot of get voices documentation


    requests.get("", params=dict(mode="tts-basic")).json()


print(requests.get("", params=dict(mode="v2v")).json())

3. Generating speech from text

You can then select a voice and generate speech from text. This is done by making a post request to the /speak endpoint. This is an asynchronous request - for synchronous support, use /speak-synchronous.

screenshot of speak documentation

Choosing a voice and text

The best way to specify a voice is by using a voicemodel_uuid returned from /voices. Let's use a text-to-speech Tacotron2 trained on LJ Speech. For historical purposes not covered here, this voice will be reading a quote from General Sherman.

voicemodel_uuid = "778e27be-877f-4b61-aefc-4eb2ff88ec11"
text = "War is cruelty, and you cannot refine it; and those who brought war into our country deserve all the curses and maledictions a people can pour out. I know I had no hand in making this war, and I know I will make more sacrifices to-day than any of you to secure peace."

Making a speak request

audio_uuid =
    json=dict(speech=text, voicemodel_uuid=voicemodel_uuid),

Checking the status of your speak request

Since the /speak endpoint is asynchronous, we need to check its status.

screenshot of speak status docs
from time import sleep

for t in range(10):
    sleep(1) # check status every second for 10 seconds.
    output = requests.get(
    if "path" in output:
        audio_url = output["path"]

Checking the output of your speak request

If your use case is able to support using the url directly, the url containing the audio output may be sufficient.

from IPython.display import Audio


Downloading the output of your speak request

For other applications, it may be preferable to download the output.

import tempfile

output = requests.get(
r = requests.get(output["path"], allow_redirects=True)
tf = tempfile.NamedTemporaryFile(suffix=".wav")
with open(, "wb") as f:

4. Converting speech from voice to voice

In addition to text-to-speech models, Uberduck supports user-contributed voice-to-voice models. The requisite steps are to specify a voice model for conversion, upload the audio to be converted, and convert the audio. We can get a list of available models using the method in Section 2 and so won't cover that again here.

Uploading your audio

You will need to upload audio to our servers to convert. This tutorial uses the output we downloaded earlier, but you may use any audio file shorter than 5 minutes.

screenshot of reference audio docs
file_path = # is the temporary file we wrote to in the previous code block.
output =
    data=dict(is_private=True, save=True),
    files={"reference_audio": open(file_path, "rb")},

Converting audio

Once your audio is uploaded, you can convert it from voice to voice. Let's convert the audio to Grimes' voice.

screenshot of convert docs
grimes_rvc_uuid = "d4b039b4-5bd2-46ba-8558-568f0029ebad"
reference_audio_uuid = output["uuid"]
convert_output =
        reference_audio_uuid=reference_audio_uuid, voicemodel_uuid=grimes_rvc_uuid

The audio will be output with the prosody and pitch of the original speaker (although it is possible to change the pitch as a parameter as well).

5. Generating Rap

The rap generator ensures that each line is distributed over its own measure at the specified beats per minute (bpm). You may supply your own lyrics, or generate them with the Uberduck lyrics generator. You may also add specify a Uberduck-provided backing track. However, you may also access the a cappella directly and mix with a backing track yourself.

User-specified lyrics

Note that the lyrics parameter is a list of lists of strings. Each verse is passed as a list of strings, so the below example specifies a rap with a single verse.

screenshot of lyrics docs
import json

lyrics = [
        "rap cat, hell make you clap",
        "hes got the hottest beats and the softest fur",
        "nothing to laugh at",
        "riding with the rap cat",

bpm = 144
lines = len(lyrics[0])
output =
    json=dict(lyrics=lyrics, lines=lines, bpm=bpm),

Adding a backing track

Uberduck provides a library of backing tracks with preset alignment to the rap and multiverse structure. It is important to match the bpm between the backing track and the audio. You may view the list of available backing tracks by making a request to the GET reference-audio/backing-tracks endpoint.

screenshot of backing track docs
output = requests.get(
    "", auth=uberduck_auth

Once you have the uuid of the backing track, you can specify this in your post request to tts/freestyle.

screenshot of freestyle docs
backing_track_uuid = "84a34767-12c0-4dc0-aa64-c292ac7d13c9"
output =
    json=dict(lyrics=lyrics, backing_track=backing_track_uuid, bpm=bpm),

Lyrics generation

Uberduck also provides the ability to generate lyrics programmatically. In order to generate lyrics automatically, Uberduck uses a wrapper around the OpenAI API with some prompt engineering. We can use this to generate lyrics about a given subject.

screenshot of freestyle docs
output =
    json=dict(subject=subject, lines=12),

Automated rap generation

We can also generate lyrics automatically to fit a given backing track. This is the end to end song generator used on Uberduck. This uses the default voice, but additional voices may be specified as well.

output =
    json=dict(subject=subject, bpm=bpm, backing_track=backing_track_uuid, voice="zwf"),

Thank you!

If you enjoyed this article, let us know in our Discord or email me at

May 9, 2023