Video content is everywhere and increasing in volume as one of the most effective ways to connect with an audience. In storytelling, gaming, education and marketing. No surprise that Nimdzi reports 60.5% of the language service providers it surveyed in 2021 to be offering “dubbing, voiceover and audio services”. This clearly reflects the growth in video localization work.
Media localization specifically is witnessing even larger growth when it comes to dubbing work and revoicing in general. A few years ago, a large work order (other than animation) would customarily involve over thirty subtitle streams plus a dozen dubbing ones. Today dubbing is performed in twenty or more languages for a given project; subtitling streams are approaching forty. Netflix alone announced that it has dubbed five million minutes of content in 2021 compared with seven million minutes subtitled in the same year, a ratio that was much smaller in the past for any content provider.
The volume of work in media localization post-pandemic has been such that it is said there is not enough talent available to handle it all. This talent crunch saga that the market has been experiencing began with a lack in all types of talent involved in a dubbing production (adaptors, actors, directors, sound engineers). Never before had we seen dubbing studios booked up so many months in advance as in Central and East Europe during the Disney+ and HBO Max launches. The situation is bound to repeat itself as platforms continue to launch in other parts of the world.
Dubbing studios have also been in the limelight due to ongoing M&A activity and large consolidation in the sector, as well as new entrants making their appearance. Similarly, start-ups involving technology for the revoicing process have been enjoying a funding spree.
Lifelike synthetic voices
Text-to-speech technology reached a new quality threshold in 2016 when deep learning was implemented. Subsequent progress has been dramatic, but the best part is there is still a long way for the technology to go. Today, synthetic voices not only sound human-like but developers have worked out ways to control them in terms of pronunciation, pace, intonation and even emotion. Earlier this year, Sonantic, the start-up behind Val Kilmer’s voice in Top Gun: Maverick, released a demo of an AI that flirts.
Studios like Disney (The Mandalorian and Obi-Wan Kenobi series) and Paramount (Ton Gun: Maverick) have already begun implementing creative solutions that involve synthetic speech, such as the use of voice conversion to de-age and reconstruct voices. The same technology could also be used to make any voiceover actor sound like a specific voice. Voice conversion has many interesting workflow applications in the recording process as well as important ethical issues to deal with in the form of deep fakes.
Custom and adaptive voices are other hot offerings in the synthetic speech arena. Adaptive speech synthesis used in an automatic dubbing pipeline can automatically produce a target voice that closely matches the source and takes on the characteristics of the source speaker. The ability to create bespoke voices allows customers to use these as brand voices with no issues over royalties or IP rights.
Applications for synthetic speech
Since synthetic speech became good enough to fool one on a first listen, we have seen a steep increase in the number of companies that use it to develop applications with the aim of automating some part of the dubbing or revoicing process. The obvious media workflow to implement synthetic speech at first instance is audio description. Voice synthesis has already been used in several languages since the pre-deep-learning era to record audio description scripts. Gaming and informational videos are two more sectors well familiar with the use of synthetic voices.
For practical purposes, synthetic voices need to be integrated in timed-text editors where the scripts used for their generation are prepared. The tools would need to enable script writers to generate their preferred voice for a specific chunk of text and to edit such generated audio using a set of controls that provide access to all aspects of the synthetic voice as well as a pronunciation dictionary for user input.
One of the companies addressing this market need is OOONA whose latest addition in its suite of tools is an audio description editor. “We always try to stay ahead of developments by listening to our customers,” says Wayne Garb, OOONA co-founder and CEO. “Given the focus on solutions for audio, our latest product offering could not be anything other than a tool catering to the needs of a modern audio description workflow.”
OOONA’s audio description editor is a timed-text tool tailored to the functionality required in a scripting workflow, which includes the ability to record in the cloud as well as use synthetic voices. Following a recent partnership between the two companies, Veritone’s award-winning voices have been integrated into OOONA’s audio description editor with intuitive controls to tweak them as needed. “If you are looking to implement the audio description workflow of the future, make sure you future-proof it with the right tool!” adds Garb.
When listening to some of the great synthetic voices showcased at IBC 2022 and looking at the development of tools like OOONA’s, one inevitably thinks about how scripting and recording workflows might evolve in the future, not just for audio description purposes. As always with new technology, there are many ways to use it, to simplify or improve existing tasks or workflows, or come up with completely new applications. The noughties were the decade when subtitling was centralized and streamlined with the use of templates and multi-language workflows. The 2010s were the decade when everything moved online. The 2020s sure look like the decade when audio gets disrupted.