Text to Speech

Give your robot a voice with a text-to-speech service.

By the end of this tutorial you will be able to:

  1. Incorporate third-party modules into your applications
  2. Transfer your understanding of motor messages to construct different forms of action
  3. Request a spoken report of your robot's current action by pressing a key

Robotics today is a bit like personal computing was in the early 1990s, and the mobile phone industry was in the early 2000s.

The companies that built the hardware would compete for their customers by producing new programs to run on them, and the decision to buy a particular model of computer or phone was largely driven by which model would be shipped with the programs that we wanted. Today, impressive demonstrations of new robotics capabilities are being showcased to the public on a weekly basis, and the competition between the biggest manufacturers for the future of consumer robots is fierce, but it’s yet to be catalysed by the equivalent of the smartphone for robotics.

What changed with personal computing and mobile phones was a decoupling of the hardware and software. An operating system like Windows now makes it possible for any personal computer to run a word processor, internet browser, or music player, irrespective of the particular motherboard, CPU, and sound card inside the computer’s case. And you no longer need to buy a Nokia 3210 if you want to play Snake; instead you can download it for any smartphone that is running Android.

The BOW SDK similarly decouples robotics hardware and software, and it achieves that decoupling through message types that abstract the underlying data exchange to the level of sensor/motor modalities. As long as the output of a piece of software can be expressed in terms of one of the SDK message types, it can be sent to any robot to act upon it. We hope that this will have just as transformative an effect on robotics, and on the way in which robot manufacturers compete for our custom.

As well as ensuring that the applications we develop are portable between robots, this decoupling also means that we can easily try out different pieces of software – perhaps different methods for computer vision, spatial navigation, or skill learning – to discover which will be the most suitable for our robotics applications. In this tutorial we’ll see how easy it is to test out different programs that each produce the same kind of output (speech), within the same application.

About the Application

The structure of the application is as follows:

Before we get stuck in:

If you are just browsing to get a sense of what's possible, take a look at the code online.

Running the Application

Make sure you have followed the setup steps above. If you are using a local text-to-speech (TTS) Inference then you will first need to startup the TTS Server (if its not already running!). Navigate to the willow-inference-server folder that you setup and run the server using the utils.sh script.

cd willow-inference-server
./utils.sh run

In a different Terminal, navigate to the Applications/TextToSpeech/Python folder in the SDK Tutorials repository:

cd SDK-Tutorials/Applications/TextToSpeech/Python

Execute the example program:

python main.py

Investigation

You can move the robot around the scene with keyboard controls. In addition you can hear the robot vocalise periodically as it is moving around. You can also use the V key to vocalise the current action being performed.

Keyboard Controls

KeyAction
WMove Forward
ATurn Left
SMove Backward
DTurn Right
EStrafe Right
QStrafe Left
VVocalise (declare current action)

Making Robots Talk

In this section, we'll explore how to make our robot talk using text-to-speech (TTS) capabilities. The provided code uses a combination of the TTS class and a speech queue system to generate and stream audio to the robot.

Understanding the TTS System

The TTS system in this tutorial consists of several key components:

  1. TTS Class: Handles the conversion of text to speech using either OpenAI or Willow services.
  2. Speech Queue: Manages speech commands to ensure smooth processing and prevent overlaps.
  3. Speech Processing Thread: Continuously processes the speech queue in the background.

Key Components:

  • Service Selection: The system can use either OpenAI or Willow for TTS generation, controlled by the USE_OPENAI environment variable.
  • Speech Queue: A thread-safe queue (self.speech_queue) that holds pending speech commands.
  • Speech Processing Thread: A dedicated thread (self.speech_thread) that continuously processes the speech queue.

How Audio Chunking Works:

  1. Text-to-Speech Conversion: The text is first converted into an audio segment using the selected TTS service.
if USE_OPENAI:
    tts_service = TTS(service='openai', model='tts-1', voice='alloy')
else:
    tts_service = TTS(service='willow', format='wav', speaker='CLB')
 
audio_segment = tts_service.stream_text_to_speech(text, playback=False)
  1. Preparing Raw Audio Data: The audio segment is converted to raw audio data.
raw_audio = audio_segment.raw_data
  1. Chunking and Streaming: The raw audio data is split into chunks and streamed to the robot. The chunk size is determined by the CHUNK_SIZE constant, which is calculated based on the sample rate and number of channels.
CHUNK_SIZE = NUM_CHANNELS * NUM_SAMPLES
 
for i in range(0, len(raw_audio), CHUNK_SIZE):
    chunk = raw_audio[i:i + CHUNK_SIZE]
    audio_sample = bow_data.AudioSample(
        Source="Client",
        Data=chunk,
        Channels=NUM_CHANNELS,
        SampleRate=SAMPLE_RATE,
        NumSamples=CHUNK_SIZE // NUM_CHANNELS,
        Compression=COMPRESSION_FORMAT
    )
    result = self.robot.voice.set(audio_sample)
    if not result.Success:
        self.log.error(f"Failed to send audio sample chunk {i // CHUNK_SIZE} to the robot.")
        break
    time.sleep(1 / (AUDIO_TRANSMIT_RATE * 2))
  1. Timing Control: A small delay is added between sending each chunk to control the streaming rate.

Audio Parameters

The audio parameters are crucial for ensuring compatibility between the TTS system and the robot's audio processing capabilities. These parameters are set before connecting to the robot:

SAMPLE_RATE = 24_000
NUM_CHANNELS = 1
COMPRESSION_FORMAT = bow_data.AudioSample.CompressionFormatEnum.RAW
AUDIO_BACKENDS = ["notinternal"]
AUDIO_TRANSMIT_RATE = 25
NUM_SAMPLES = SAMPLE_RATE // AUDIO_TRANSMIT_RATE
 
self.audio_params = bow_data.AudioParams(
    Backends=AUDIO_BACKENDS,
    SampleRate=SAMPLE_RATE,
    Channels=NUM_CHANNELS,
    SizeInFrames=True,
    TransmitRate=AUDIO_TRANSMIT_RATE
)
 
self.robot, error = bow_api.quick_connect(
    pylog=self.log,
    channels=["voice", "vision", "motor"],
    verbose=True,
    audio_params=self.audio_params
)

These parameters define the audio format (sample rate, number of channels), compression format, and transmission rate. They ensure that the audio chunks are correctly formatted for the robot's audio system.

Speech Queue Processing

To manage multiple speech commands efficiently, a queue system is implemented:

def process_speech_queue(self):
    while True:
        text = self.speech_queue.get()
        if text is None:  # Allows for graceful shutdown
             break
         self.send_speech_command(text)
         self.speech_queue.task_done()
        time.sleep(0.2)  # 200 ms delay between processing queue items

This queue system allows for smooth handling of multiple speech commands, preventing overlaps and ensuring each command is processed in order.

On this page