Zach Ocean

How to build a text-to-speech Discord voice bot in Python

In this post, you'll learn to use the Uberduck API to build a Discord bot that can send text-to-speech messages in voice channels.

In this post I'm going to show you how to build your very own text-to-speech Discord bot using Python, the _nextcord library, and the Uberduck API. Once we're finished you'll have your own Discord bot that can join Discord voice channels and send text-to-speech messages in those channels at your command. You can even clone your own voice and use your own text-to-speech in Discord with our private voice clone plan.

Here's a demo of what we'll have at the end:

You can find the full implementation of the bot on our GitHub.

Create a bot in the Discord Developer Portal

Before we dive into coding, we need to do some Discord setup. Head to the Discord Developer Portal and create a new application.

I'll call my application "Uberduck TTS Demo".

Once you create the Discord application, create a Discord bot user.

To let other users add your bot to their Discord servers, go to the OAuth2 section, select "In-app Authorization" as the Authorization Method, select the _bot and _{applications.commands} scopes, and check the _{Send Messages}, _Connect, and _Speak scopes.

Then, to add the bot to your server, go to the OAuth2 URL Generator, select the same scopes and commands, and copy and paste the generated URL into your browser.

You should see something like this:

Go ahead and select one of your servers and click _Continue to add the bot. Now the bot should be in your server and it's time to strap in and write some code!

Implement the bot code

Remember, you can find the full implementation in this GitHub repository.

The recommended way to implement Discord interactions is with slash commands—commands that begin with the / character and pop up a completion menu inside Discord. Our bot will implement three slash commands:

_/vc-joininvites the bot to join a voice channel.
_/vc-kick kicks the bot out of a voice channel.
_/vc-quack generates text-to-speech audio from a specific voice and plays it in the voice channel.

Install dependencies

You'll need ffmpeg and the Opus audio codecs.

You'll also need a Python environment with _nextcord and a few other Python packages installed—follow the instructions in the README to get set up.

Create the bot and set up commands

We'll create the Discord bot and implement slash commands using the _nextcord library.

import asyncio

import nextcord
from nextcord import SlashOption
from nextcord.ext import commands

# It's important to pass guild_ids explicitly while developing your bot because
# commands can take up to an hour to roll out when guild_ids is not passed. Once
# you deploy your bot to production, you can remove guild_ids to add your
# commands globally.
#
# You can find your guild ID by right clicking on the name of your server inside
# Discord and clicking "Copy ID".
DEV_GUILD_ID = 0 # Replace with your guild ID
guild_ids = [DEV_GUILD_ID]

bot = commands.Bot()

@bot.slash_command(
    name="vc-join",
    guild_ids=guild_ids,
)
async def join_vc(ctx: nextcord.Interaction):
    """Join the voice channel."""
    await ctx.response.send_message("I'm not implemented yet!")

# Do the same thing for /vc-kick and the rest of the commands...

# Run the bot
DISCORD_TOKEN = "replace-me-with-your-bot-token"

if __name__ == "__main__":
    loop = asyncio.get_event_loop()
    try:
        loop.run_until_complete(bot.start(DISCORD_TOKEN))
    except KeyboardInterrupt:
        loop.run_until_complete(bot.close())
    finally:
        loop.close()

You can run your code now and try out the _/vc-join command to test that it prints out the message _{I'm not implemented yet!}

Handle leaving and joining voice channels

Let's implement the bodies of the _/vc-join and _/vc-kick commands. We'll use a Python _dict to keep track of voice channel clients and the time that the client was last used. (We'll use that last used time later to clean up idle voice clients.)

guild_to_voice_client = dict()

def _context_to_voice_channel(ctx):
    return ctx.user.voice.channel if ctx.user.voice else None


async def _get_or_create_voice_client(ctx):
    joined = False
    if ctx.guild.id in guild_to_voice_client:
        voice_client, last_used = guild_to_voice_client[ctx.guild.id]
    else:
        voice_channel = _context_to_voice_channel(ctx)
        if voice_channel is None:
            voice_client = None
        else:
            voice_client = await voice_channel.connect()
            joined = True
    return (voice_client, joined)

‍

Now we can implement _/vc-join and _/vc-kick, including some edge case handling around switching from one voice channel to another and printing error messages if there are no voice channels to join or kick from.

@bot.slash_command(
    name="vc-join",
    guild_ids=guild_ids,
)
async def join_vc(ctx: nextcord.Interaction):
    voice_client, joined = await _get_or_create_voice_client(ctx)
    if voice_client is None:
        await ctx.response.send_message(
            "You're not in a voice channel. Join a voice channel to invite the bot!",
            ephemeral=True,
        )
    elif ctx.user.voice and voice_client.channel.id != ctx.user.voice.channel.id:
        old_channel_name = voice_client.channel.name
        await voice_client.disconnect()
        voice_client = await ctx.user.voice.channel.connect()
        new_channel_name = voice_client.channel.name
        guild_to_voice_client[ctx.guild.id] = (voice_client, datetime.utcnow())
        await ctx.response.send_message(
            f"Switched from #{old_channel_name} to #{new_channel_name}!"
        )
    else:
        await ctx.response.send_message("Connected to voice channel!")
        guild_to_voice_client[ctx.guild.id] = (voice_client, datetime.utcnow())


@bot.slash_command(name="vc-kick", guild_ids=guild_ids)
async def kick_vc(ctx: nextcord.Interaction):
    if ctx.guild.id in guild_to_voice_client:
        voice_client, _ = guild_to_voice_client.pop(ctx.guild.id)
        await voice_client.disconnect()
        await ctx.response.send_message("Disconnected from voice channel")
    else:
        await ctx.response.send_message(
            "Bot is not connected to a voice channel. Nothing to kick.", ephemeral=True
        )

‍

Generate text-to-speech and play audio over the channel

Now that we have _/vc-kick and _/vc-join, let's implement _/vc-quack. First, write code to query the Uberduck API. You'll need to generate an API key and secret, which you can do on your Uberduck account page.

The Uberduck API is free to use, but the free API queues requests alongside all other free users of our site. If you want API requests to be faster, you can upgrade to the Creator plan or the Clone plan (which gets you your own custom voice clone of yourself).

from io import BytesIO
import asyncio
import json
import time

import aiohttp

# Your Uberduck API key and API secret.
# You can create a new key and secret at https://app.uberduck.ai/account/manage
API_KEY = "replace-me"
API_SECRET = "replace-me"
API_ROOT = "https://api.uberduck.ai"


async def query_uberduck(text, voice="zwf"):
    max_time = 60
    async with aiohttp.ClientSession() as session:
        url = f"{API_ROOT}/speak"
        data = json.dumps(
            {
                "speech": text,
                "voice": voice,
            }
        )
        start = time.time()
        async with session.post(
            url,
            data=data,
            auth=aiohttp.BasicAuth(API_KEY, API_SECRET),
        ) as r:
            if r.status != 200:
                raise Exception("Error synthesizing speech", await r.json())
            uuid = (await r.json())["uuid"]
        while True:
            if time.time() - start > max_time:
                raise Exception("Request timed out!")
            await asyncio.sleep(1)
            status_url = f"{API_ROOT}/speak-status"
            async with session.get(status_url, params={"uuid": uuid}) as r:
                if r.status != 200:
                    continue
                response = await r.json()
                if response["path"]:
                    async with session.get(response["path"]) as r:
                        return BytesIO(await r.read())

‍

With the Uberduck API call in place, now we can implement _/vc-quack.

import subprocess
import tempfile

@bot.slash_command(
    name="vc-quack",
    guild_ids=guild_ids,
)
async def speak_vc(
    ctx: nextcord.Interaction,
    voice: str = SlashOption(
        name="voice", description="Voice to use for synthetic speech", required=True
    ),
    speech: str = SlashOption(
        name="speech", description="Speech to synthesize", required=True
    ),
):
    voice_client, _ = await _get_or_create_voice_client(ctx)
    if voice_client:
        guild_to_voice_client[ctx.guild.id] = (voice_client, datetime.utcnow())
        await ctx.response.defer(ephemeral=True, with_message=True)
        audio_data = await query_uberduck(speech, voice)
        with tempfile.NamedTemporaryFile(suffix=".wav") as wav_f, tempfile.NamedTemporaryFile(suffix=".opus") as opus_f:
            wav_f.write(audio_data.getvalue())
            wav_f.flush()
            subprocess.check_call(["ffmpeg", "-y", "-i", wav_f.name, opus_f.name])
            source = nextcord.FFmpegOpusAudio(opus_f.name)
            voice_client.play(source, after=None)
            while voice_client.is_playing():
                await asyncio.sleep(0.5)
            await ctx.send("Sent an Uberduck message in voice chat.")
    else:
        await ctx.response.send_message(
            "You're not in a voice channel. Join a voice channel to invite the bot!",
            ephemeral=True,
        )

‍

Alright, now we have a working Discord bot! Run your bot script, and you should be able to join a voice channel, run _/vc-join to invite the bot, generate speech by running _{/vc-quack voice:zwf speech:I like working on Uberduck}, and then kick out the bot with _/vc-kick.

Clean up idle voice clients

We just have one more step before our bot is ready for prime time. We don't want an idle bot to hang out unused in voice chat forever, so we'll build a mechanism to disconnect bots from chat when they're not being used. This is where the _{last_used} timestamp stored along each voice client comes into play. Every five seconds, we'll loop over all the voice clients and disconnect any that haven't been used in 10 minutes.

async def terminate_stale_voice_connections():
    while True:
        await asyncio.sleep(5)
        for k in list(guild_to_voice_client.keys()):
            v = guild_to_voice_client[k]
            voice_client, last_used = v
            if datetime.utcnow() - last_used > timedelta(minutes=10):
                await voice_client.disconnect()
                guild_to_voice_client.pop(k)

‍

Now we can modify our script run the termination script concurrently with the bot using _{asyncio.gather}.

import asyncio

from .client import bot, terminate_stale_voice_connections

DISCORD_TOKEN = "replace-me-with-your-bot-token"


if __name__ == "__main__":
    loop = asyncio.get_event_loop()
    try:
        loop.run_until_complete(
            asyncio.gather(
                terminate_stale_voice_connections(), bot.start(DISCORD_TOKEN)
            )
        )
    except KeyboardInterrupt:
        loop.run_until_complete(bot.close())
    finally:
        loop.close()

That's it! We now have a working Discord bot.

If you enjoyed this article or want to build a bot of your own, let us know in our Discord (where you can use the Uberduck Discord bot, a version of the bot we just built with a few more features) or email me at z@uberduck.ai.

‍

July 6, 2022