Creating rich audio for voice-first experiences

By using a pop shield to avoid plosives and absorbent material to reduce reverb, good quality audio can be captured affordably. Photograph: David O'Donnell/The Guardian

From speaking with the voice app community, there appear to be two camps: software developers and content producers. Through meeting in the middle, both are realigning their skills and learning new ways of working.

Yet, even without a background in audio, it’s possible to affordably record your own rich audio, surpassing the quality of experience achievable through the default text-to-speech synthetic voice approach to app builds. Getting the recording stage right is crucial, and will pay dividends later on.

If you’re already familiar with audio, we’ll also explore some of the unique challenges we’ve faced whilst working with human voices on voice apps.

Mic-check

First up, the equipment. When your final files will ultimately be exported and encoded to ensure speedy playback, a state-of-the-art recording studio isn’t necessary. By searching online for USB mics, you’ll see there are lots of affordable options more than capable of producing sufficiently good quality sound for voice apps. Here are some pointers to ensure you use your equipment in the right way and in a suitable environment.

Once purchased, plug your mic in to your laptop and begin to monitor the response of the mic to your voice, by using headphones. Move around from left to right of the mic, far to near, to get familiar with where the sweet spot of that microphone is. Some mics have adjustable pick-up patterns so make sure “cardioid” is selected so your mic is only picking up sound coming from your direction.

In terms of “level” or how much signal is coming through, you want your voice to be loud, without ever clipping or hitting 0db. If the level is too low, look at whether you can adjust the gain on your microphone.

If you notice the presence of reverb in the room, think about where you’ve positioned yourself. Are you facing a wall where your voice’s sound waves are passing over the mic, hitting the wall in front and bouncing straight back into the mic causing this delayed reflection of sound? Turn to be facing into the room. If this isn’t possible, avoid being at direct right-angles to walls so your sound waves reflect away. Or you can even stick a duvet and pillows up in front of the wall to absorb sound. All these techniques can all help reduce the presence of reverb in your recording, for that more “professional studio” dry sound.

Beyond reverb, background noise can be another characteristic of poor audio quality. If possible, find a space that has limited traffic outside, people speaking nearby etc. Outside of professional studios, this can’t be avoided entirely, but you can often get away with more than you think, using gates and expanders to fix in post-production.

Plosives are another source of difficulty for audio novices. Plosives are percussive sounds caused by the use of Ds, G, Bs, Ts, Ks and Ps in speech, and these sudden releases of air from the mouth can cause the microphone to clip the audio. Thankfully, a pop shield placed between the actor and mic is a very affordable way to prevent this.

Recording the script

Now, even for content producers familiar with all of the above, capturing a good vocal performance for a voice app, whether you’re doing it yourself or working with voice talent, requires some focus.

If your talent hasn’t been involved in the development of your app, it’s crucial that you take the time to bring them up to speed on what it is, what it will do, and what you’re wanting from their performance... And it is just that, a performance, and to get a good one, the voice actor needs as much context as possible.

We’ve found that if there are audio assets already produced that can provide additional context during the recording, they can be used effectively to capture a more natural performance. That said, we’ve not always been able to do this ourselves due to the assets still being in production, or dynamic in nature (automatically generated text to speech), or simply because we only have limited access to the talent.

Rich audio-focused voice apps differ from linear radio or podcast productions in that apps often involve various complex branches that map to wireframes. These complex structures can’t be documented easily in a linear script, so instead they tend to live in spreadsheets during development. This particular meeting point of software developer and content producer presents some challenges because currently no standard workflows exist, and instead everyone uses a different set of tools. Software developers think in spreadsheets, whereas actors tend not to, and we found handing actors printed spreadsheets at the start of a record isn’t conducive to natural vocal performances.

Our solution has been to use the “truth” spreadsheet that the entire team works from as a starting point, then creating from it various “performance scripts” that can be presented in a more traditional script form, allowing the actor to work in a clear manner that works for them. This facilitates the voice actor in understanding the journey and buying into the narrative, rather than working through each spreadsheet cell, which may or may not have any relation to the previous cell.

Of course, not all information in the spreadsheet can be distilled into these linear scripts, so additional recordings are often required to capture the remaining content, but we found by avoiding the use of spreadsheets for as long as possible, we were able to help elevate the talent’s performance.

If you’re giving the vocal performance yourself, read through your script first, and decide what the melody of the sentences will be. Where do you want emphasis? Where will you breathe? Is the sentence a question or definitive statement? What’s the essence of what you’re communicating? Once you’ve answered those questions, you’ll be in a good position to give a performance that feels natural and human.

Finally, don’t overdo a recording session. We’ve found that going beyond an hour without breaks is difficult, and when you listen back to the recording, the start of the performance has very different energy levels to the end. This dynamic range from high to low energy is not helpful when you’re looking to edit from across the session, while still creating a coherent and consistent experience for users. So keep recording sessions short.

Before you move on to post production, these tips will help ensure you’ve captured a consistent and natural vocal performance that’s recorded competently, and with this higher quality source material you can achieve a final output that will greatly improve your voice app, going beyond what could be achieved with text to speech.

Find out more about the Voice Lab’s mission or get in touch at voicelab@theguardian.com.

Richard Hartley

Technology, Photography & Film

Creating rich audio for voice-first experiences

Leave a Comment Cancel comment