As social networking pivots to audio-only, David Keene of Speechmatics examines the challenges tech companies face to ensure we can all be understood
No one can deny that the Covid-19 pandemic has had a transformative impact on both our professional and personal lives. Nothing illustrates this more than the boom in audio-only communication as we tried new ways of staying in touch with friends, family and work in a distanced way.
In the last year, Clubhouse, a social network based on voice, has gone from an unknown startup to a tech giant. As of April 2021, the company was valued at over $4bn and has had recent appearances from major figures of the tech industry like Elon Musk and Mark Zuckerberg.
This exponential growth in audio services is just the latest installment of what has been a growing trend for some years now. Take podcasts for instance – according to OFCOM, in the UK one in eight people listen to a podcast every week – an increase of 24% over the last year. As much as the likes of Netflix ushered in a surge in on-demand video, podcasts are having a similar moment in the audio space. Voice notes have also seen a major uptick in usage in recent years – as a younger generation becomes accustomed to interacting with media at their discretion, it’s no revelation they will do the same when it comes to personal communication too.
The pandemic has inevitably accelerated these trends, when in-person communication has been impossible. For many, Zoom fatigue has meant that when the opportunity to speak to others off camera presents itself, they will take the opportunity to do so. People want to talk, but not necessarily show their faces, therefore audio-only methods of communication avoid the presenteeism and over stimulation of video calls.
This pivot to audio hasn’t gone unnoticed by the big industry players: Twitter has rolled out Spaces and Facebook has created Live Audio Rooms to capture this growing demand. These companies are not just jumping on the bandwagon – they will be acutely aware of the value of voice data in the long-term. These companies thrive on the insights they glean from their data and audio is no different. However, these technologies are not without issues.
Does not compute!
For the most part, the technology which converts speech to text using machine-learning (speech recognition) is trained on limited datasets that often don’t account for a variety of pitches, accents, languages and background noise. As a result, huge swathes of the population around the world will be misunderstood because the technology was not trained to recognise their existence. This presents a major challenge to the industry: in order for this technology to be truly ethical and inclusive, significant changes are required.
Part of the challenge is that speech data is incredibly rich. Unlike other forms of data, enormous amounts of information can be understood by what we say, how we say it, as well as the pitch and tone we use. Take for example distinguishing between sarcasm and sincerity. This is a learned human skill and seems intuitive to most of us, but for a machine to make this distinction is another story entirely. This is but one of the considerable technical challenges presented by speech data and for the businesses figuring out these intricacies, they can be a potential goldmine. This is because social media platforms have revenue models which are defined by their ability to know their users intimately, and speech recognition in particular can provide a wealth of insight unrivalled by other forms of data. In order to monetise it, these companies will need to figure out how to compute this information – which is easier said than done considering how limiting the available training data in this area is.
Also, no two people have the same voice, each is a unique biometric, just like a fingerprint. Moreover, not all speech comes with perfect diction and in an accent that is immediately intelligible. Many in society have difficulties with speech – whether it be the hard of hearing, or those with speech impediments. For example, 1% of the world’s population (70 million people) have a stutter, so we need speech recognition technology to accommodate everyone and provide workable solutions to a wide spectrum of users, not just a majority. So, what can we do about it?
Making our voices heard
Total inclusivity in speech recognition is an ongoing challenge, and it may be years before we achieve a truly holistic understanding of speech. Publicly available speech data is predominately white, male and English speaking which is, of course, nowhere close to what is required to truly understand everyone. As anyone working in technology knows, one of the dangers of scale is poor product performance under strain. Being able to include every one of those voices into machine learning algorithms will require diverse datasets.
The tech giants have the resources for this, but lack the processing power. They have a real opportunity to solve the bias problem in machine learning models through democratising access to the vast amounts of public data they hold. Of course, for companies that monetise data this would seem an illogical move; but there is a middle way, where they could use a version of a data clean room to give the insights a level of anonymity, ensuring privacy but still diversifying the outputs of models.
However, data sovereignty is a murky area: very few consumers know exactly whether their data is being kept privately by a company, is in the public domain or is even being sold to third parties. Voice is no different and as it continues to become an increasingly coveted source of data, we need to ensure privacy is maintained. Rules around privacy should be the building blocks of any machine learning algorithms operating in this space, and by balancing these with open-source data sharing, we can safely achieve a compromise where speech recognition becomes more inclusive.
Among the many advantages of increased inclusivity is in moderating activity online. Through understanding more people, our ability to assess and remove harmful content increases – it is impossible to accurately moderate communities unless everything being communicated can be recognised. The consequences for not doing this are considerable – just last month, it was announced TikTok was being sued for billions over its use of children’s data. By failing to differentiate between children and adults, it is alleged that the social media platform has failed in its moral duty to protect them.
Ultimately, the pandemic brought on a major boom in audio. Despite having great technology platforms to be able to host and support the increased user base, the data used to process and understand the many voices has in many ways failed to meet the challenge of inclusivity.
There are solutions to this problem – such as increasing the publicly available datasets – but only if these are appropriately anonymised. The advantages of this are enormous, whether it be content moderation or even simply the improvement of products and services. The pivot to audio is an exciting chapter in tech and society, but we need to ensure every voice is of equal value. By guaranteeing that everyone’s voice can be heard, society as a whole will benefit and we will see speech and audio technology flourish.
Further information