As announced in our previous blogpost, the Voice Lab launched the Guardian Briefing last week. At the Guardian, we have many journalists doing phenomenal work, so a question we posed at the start of the project was: how can we leverage the work being done in the building to create an audio news briefing that works on smart speakers?
To explore this, we were drawn to the many newsletters the Guardian regularly produces, because while showcasing the journalism, many including the popular Morning Briefing also feature a personal editorial voice and curation of stories, lending itself well to use for a news briefing on smart speakers.
Current news briefings
Many developer-led briefings solely use text-to-speech synthesised robot voices on their offering, which can make for a less pleasing listening experience, while other more established broadcasters employ humans to step into a booth and record updates throughout the day, achieving a much more pleasing experience but at considerable expense. As the Voice Lab is about experimentation, we instead decided to explore a third approach; a briefing that combines human voice “rich audio” with synthesised voice, providing both a human touch and the flexibility/replicability of TTS. To do this, we turned to SSML.
SSML
SSML is a markup language for speech synthesis applications, which allows you to make a text-to-speech voice more lifelike, adding breaks and pauses, tweaking vocal pitch, word rate etc. However, it also allows rich audio files to be played, and uniquely for Google Assistant, all items can be programmed to run in parallel (thanks to the parallel (<par>) attribute, currently unavailable on Alexa). This parallel feature broadens the creative potential of SSML by allowing for the construction of multilayered audio experiences combining rich audio and TTS.
As an audio producer with little to no previous coding experience beyond tweaking a Wordpress template, SSML was daunting initially. Yet thanks to the encouragement of colleagues and a couple helpful blogposts (this one and this one), I was able to explore SSML’s functionality using the Google Assistant simulator. Using Visual Studio Code as my primary environment to write and save the various versions of my SSML, then copying and pasting it from code into the simulator, allowed me to play the SSML back in audio form and learn through trial and error, tweaking here and there in the simulator, before pasting the SSML back into code to save the updated version. Once I was happy with the outcome, I’d download an MP3 from the simulator to share with the team.
Having spent the past 10 years working in digital audio workstations (DAWs) such as Cubase, Logic and Ableton, where endless numbers of files can be effortlessly manipulated, duplicated, layered and tweaked to the millisecond, I was initially concerned with how limiting SSML is. But as I became more familiar with its limitations, I soon realised enormous potential remains. After all, none of the DAWs I’ve ever used have been able to build infinite versions of a feature or radio show based on a template I’ve built. SSML has the potential to do just that.
The payload
Research from NPR and Reuters shows there’s an appetite for short news formats, so we initially proposed delivering five headlines in less than two minutes. Exploring the Morning Briefing as our template source, we were struck by the fact that its editorial curation of headlines alongside previews of other content offerings, from podcasts to long reads, added considerable value beyond simply scraping the website for headlines. So to reflect this in our briefing, the five headlines were replaced by three headlines, a podcast plug and, in the final slot, a promo for a recent deep dive/long read article, all pulled from the Morning Briefing.
In user testing, of the five slots, the headlines performed best. We believe the later slots performed less well in part due to the copywriting, which failed to state explicitly that Today In Focus is our daily podcast, and with our “deep dive” feature, in attempting to be broad enough in our wording to include many different types of longer-form article, it ultimately was too vague for users to easily comprehend what it was promoting.
This confirms the fact that with voice apps, clarity of intention and concision of wording are absolutely key to a successful action, as unlike with a written article, the user can’t rewind, or re-read the sentence for clarification.
In response to user testing, we tweaked the scripting to make it explicit that Today In Focus is indeed our flagship podcast. The “deep dive” couldn’t be fixed so easily, so while remaining heavily reliant on the Morning Briefing, we decided to replace the final slot with content pulled from elsewhere, namely the daily email feed, creating a new “trending” feature that pulls a top story from the “Most read in last 24 hours” section. This gives the listener yet another headline, but with a slightly longer-term perspective.
The human voice
The Guardian doesn’t have a building full of presenters to regularly record the briefing, so instead we decided our human voice would fulfil the static function of furniture to the dynamically changing synthesised voice content. We wanted our voice to once again reflect the Guardian, so we returned to Leah, who voiced our initial project, Year In Review.
The varied nature of news journalism, together with this element of dynamic synthesised voice, meant that during Leah’s record, we didn’t know what future content would follow it on a given day, so the tone of delivery was paramount. One particular moment of an earlier script record produced a very pleasant, somewhat upbeat phrasing, which we liked in the edit, but once in the context of the briefing, it revealed itself to be inappropriate when followed by serious/tragic headlines. A follow-up re-recording successfully aimed for a slightly more restrained delivery, which makes up the current version of the action.
Looking to build daily habits, we were conscious not to overdo the human voice scripting that audiences will hear every day, so scripting was kept concise, clear and unobtrusive.
The synthesised voice
As advancements in synthesised voice continue apace, to ensure we achieved the most engaging listening experience currently possible, we experimented with the various controls that SSML allows for, creating various test templates with adjusted “prosody” and listening back to them in the simulator. This involved playing with faster and slower speech rates, at higher and lower pitches, as well as exploring the use of the different voices available.
Later iterations began treating headlines and body texts differently to see whether their unique functions could be emphasised further through tweaked prosody. I’ve noted that newscasters tend towards a slightly slower delivery than natural speech, and so the best results were achieved by slowing the voice down slightly. Slightly pitched up headlines also appeared to help communicate their importance.
Adding the “break” attribute between headlines and body texts helped to create natural pauses of speech, and on a programmatic level, we ensured that “full-stops” were added to the end of all headlines (even when they were missing in the source material), so the intonation of the sentences completed naturally.
More can be done with TTS on a “sentence by sentence” micro level when the copy is fixed, but due to the dynamic nature of our synthesised voice, our tweaks remained macro.
WaveNet voices developed by Deepmind were introduced as standard on the Google Assistant platform on the week of launch, introducing a remarkable step up in the quality of synthesised voice. Many of the previous prosody tweaks became unnecessary as the standard of voice suddenly improved, however, a slightly slower than default delivery still felt best for a news journalism-style delivery.
Interestingly, between the various WaveNet voices, inconsistencies in performance remain, with “en-GB WaveNet B” producing a remarkably natural rhythm and intonation, which the other voices fail to match, making it the obvious choice. Give them a try here.
As we iterate on the Guardian Briefing over coming weeks, we’re keen to explore the WaveNet voices for other territories, such as the US and Australia.
Further rich audio opportunities
Having discovered SSML’s ability to layer sounds, I was keen to introduce music into the briefing to produce a richer, more polished sonic experience. Playing a helpful role in signposting, the use of different music allows for the breaking up of the payload into smaller subsections, making an already short piece of content feel even more manageable.
Picking the music involved a couple of exercises to articulate what we were aesthetically looking to achieve with the music. Once decided upon, and a shortlist of suitable licensed tracks was selected, I began chopping the music tracks up to create intro sections, looping beds and outros, and tweaking the mix levels of the stems ensuring for example, that the drums weren’t overbearing in the mix. By leveraging the existing musical structures in the songs, I played with what could be done with SSML, to give the impression of a fully produced package that begins, establishes itself and ends satisfactorily.
SSML deals well with seams, playing one clip followed immediately by another, but it also includes the repeatCount attribute, which allows for looping of audio. So, initially after the intro music, I used the attribute on a four-bar loop, to act as a bed under the voices, and although it worked pretty well, there was always a slight momentary delay as the loop retriggered. The ear is pretty good at spotting delays when rhythms are involved, so ultimately the loop was replaced with a longer music clip, in order to not require the use of repeatCount.
Other musical seams between the intros, main body and outros were managed effectively using fade in and fade out times, assisted by the fact the intros and outros were mainly ambient, avoiding clashes of percussive elements as they transitioned. Such musically dissonant clashes are entirely possible as these transitions are triggered by the end of speech content of unknown length, rather than by any musical considerations, such as at the start of bars or “to the beat”.
Regarding the overall arrangement, thanks to the parallel attribute, elements can both overlap or stop due to the triggering of another clip. This can be used creatively, with voice elements being used to mask musical transitions, or music can begin during a vocal pause, to ensure it has maximal impact. With only the “sequence” (<seq>) attribute available, none of these more nuanced overlapping techniques can be achieved.
Tweaking the volumes of the vocal and musical elements was done mostly on in-ear headphones. Once the alpha release was live, I was able to test the mixdown on the relevant devices themselves, and further tweak the SSML to produce an optimal sound on the devices themselves.
The output
We’ve heard some feedback from voice developers about SSML “being flaky”.
The only issue we encountered was that the payload delivery became slow on devices once the project was pushing the upper limit of SSML’s 5,000 character code limit. Understandable, as the SSML has to fetch the various audio files and build the package on demand prior to playback. To remedy this, we established a cloud service schedule that intermittently created an OGG audio file from the SSML, which is then delivered upon request from the user, achieving much more rapid delivery (more on this in a later blog).
Beyond this time delay issue, there’s been no similar evidence of instability within the action. On the contrary, SSML has appeared to be very resilient to work with, and for me, as someone with a background in audio production rather than technology, it’s been thrilling to see how SSML is able to take a template and generate a new audio file every day, or every hour, producing a brand new briefing that sounds polished and well produced regardless of the variable lengths of the five content slots delivered by the synthesised voice.
It’s exciting to imagine what further editorial potential can be unpacked using this tool.
Find out more about the Voice Lab’s mission or get in touch at voicelab@theguardian.com