While Automated Speech Recognition (ASR) technology has been present in various forms for decades, advances in statistical modelling, artificial intelligence (AI) and automation connectively have resulted in a new frontier for speech-based interaction between humans and computer systems. In this article Dr Peter Chapman, Director in the KPMG Forensic Technology team and InfoGovANZ advisory board member, details some of the current applications of ASR technology and offers guidance on a number of emerging governance issues associated with these technologies.
As a concept, computerised Automated Speech Recognition (ASR) has been around almost as long as the computer itself. However, only in the last decade have the capabilities of ASR technology reached the point where wide-scale commercial adoption is viable1. Natural human speech contains slang terms, dialect peculiarities, abbreviations and other “non-standardised” content. While humans are very adept at managing these issues, the enormous variability of human speech makes ASR a very complex and difficult task for a computer. Being able to accurately identify and interpret natural human speech requires a combination of complex statistical methodologies, such as hidden Markov models which divide detected speech content into small sections for analysis, and artificial intelligence/machine learning techniques, such as neural networks which are processing environments that simulate human learning by comparing newly observed data with known examples2.
In particular, advancement in AI capabilities have allowed ASR technologies to rapidly improve the speed of speech analysis and also reduce the word error rate (WER) of transcription – being the ratio of word errors to total words processed. Despite these advances, many difficult to control external variables exist which generally reduce ASR quality in real world environments to lower levels of accuracy when compared with human transcription3. Factors such as speaker age, background noise, accent, volume, pitch and tempo as well as variance in language structures can cause higher than acceptable false positive and false negative errors, can all affect the viability of ASR applications.
It is also worth noting that the advent of ASR technologies means that we are now entering a new era of heightened surveillance, sometimes by our own choice. While there may a certain level of acceptance regarding certain kinds of surveillance in public spaces and the workplace, it is an entirely different thing for individuals to install Internet-connected “listening devices” in their homes. While some may have actively weighed up the convenience and functionality provided by the devices over their privacy or security concerns, many others may have little appreciation of the potential risks involved.
ASR technology is also likely to have a significant impact on privacy outside of the home. While the capacity to record conversations in workplaces, retail spaces and other public places has existed for many years now, we now have the means to quickly and accurately transcribe the content of recorded conversations, and then apply advanced analytics and identification technologies to that transcribed content. To safeguard privacy and minimise ethical abuses of these technologies, it will be vital for governments to implement carefully considered legislation and appropriate industry bodies to provide sound ethical guidance.
Leveraging AI and Automation
While many of the earliest implementations of ASR technology were simple transcription applications, modern ASR applications often focus on using voice as the method for direct interface with an application or system. Such functionality still effectively requires the same processing and conversion of speech to text; however, it is the automation aspects and capacity for two-way voice interaction with technology that has enabled modern applications of ASR technology to leap forward.
Probably the most widely known example of device-voice interaction is Siri, the virtual assistant application associated with Apple devices. Initially developed as a third party application in 2010, Siri was purchased and integrated as a core iOS feature to assist users activate and interface with various phone functions and search the Internet4. An early leader in the “virtual assistant” field, Siri was joined over time by a range of competing consumer applications such as Amazon’s Alexa, Google Assistant, Microsoft Cortana, Samsung Bixby as well as a raft of lesser well known virtual assistants.
While the core interface functionality offered by virtual assistants is undoubtedly useful, a further fascinating aspect of these assistants is the voice feedback provided to the user following the initial, establishing an interactive conversation between human and machine. The capacity to interact with a machine through such a natural communication method (as opposed to pressing keys) creates a scenario with potentially significant psychological, cultural and ethical impacts. These issues have attracted interest in both academic5 and social culture – one of the more notable fictional explorations being the interactions between Dr David Bowman and the AI HAL-9000 in the science fiction movie “2001:A Space Odyssey”.
In addition to virtual assistants, smart home devices and automation hubs link ASR technology with various electronically controlled home and workplace systems. Some of the more common systems linked to home and workplace automation hubs include lighting, audio-visual equipment, heating, air conditioning, cameras and security systems. As mentioned above, society is still in the process of weighing the efficiency benefits and privacy impacts of this technology, however there are a number of other novel applications of ASR technology that demonstrate both the benefits and risks of voice interaction with systems and devices.
Conversational AI for Customer Service
For some time now, organisations have applied ASR technology to assist with service call direction and providing basic information to customers over the phone. Organisations have also been able to deploy AI agents to provide interactive service conversations when potential customers visit their webpage. The combination of these two technologies will allow customers to interact with “smart” AI agents who will be able to manage increasingly complex service and assistance conversations over the phone or an audio conference call.
AI agents are invariably cheaper, more consistent and constantly available, although there is still a certain stigma associated with relying on automated systems rather than human service agents. Conversational AI still has a long way to go if the Turing test (being able to hold a conversation that is indistinguishable from one with a human) is the standard that must be reached. However, as ASR and AI technology improves, customers may find interacting with an AI service agent quicker and easier than a human one regardless of whether the customer can determine whether they are dealing with an AI agent or a human one.
Conversation analytics and monitoring,
We have now reached the point where processing power and AI optimisation mean that recorded conversation content can be transcribed and analysed at almost real-time speeds. This provides the opportunity for service agents to have a live feedback loop on the conversation with the AI analysing conversation content to providing insights on how best to assist the customer as well as guidance on customer tone and mood6.
Real-time conversation analytics are also very useful in a monitoring and compliance capacity. “Red flag” and atypical words and phrases are analysed automatically by the AI, raising the alert with the organisation’s compliance function when necessary. The compliance capacity of ASR and AI technology is of particular interest in the regulatory technology (“RegTech”) space, with ASIC and other regulators strongly backing the increased use of real-time ASR and AI driven technologies for monitoring trading room conversations as well as other relevant financial advice/transaction conversations7.
A further innovative use of “real-time” ASR and AI technology is live translation between different languages. Similar to the “universal translator device” popular in science fiction series such as Star Trek, real time translation applications theoretically provide the capacity to speak in one language and, within seconds, have this speech converted and played back in one or more different languages8. While this technology still has a way to go before it becomes flexible and accurate enough to replace human translators, the potential social and commercial benefits of this application of ASR technology are boundless.
Medical and expert system applications
ASR technology used for speech to text translation is already in common use in the medical field due to the advantages of being able to convert voice to text on the fly while the practitioner’s hands are otherwise occupied. In recent years, more specific applications of ASR technology has in more advanced applications of ASR technology are being explored including in speech pathology treatments where patients are able to utilise a tablet computer as a supplementary treatment in speech rehabilitation programs. Early studies examining the use of this technology have found significant positive patient outcomes9.
While AI technology has not yet reached the point where it can reliably outperform human experts in most tasks, it has become powerful enough to assist human experts carry out certain tasks more efficiently. Much like home automation systems taking care of domestic tasks, a combination of ASR and AI technology can assist a pilot in charge of a complex aircraft by managing less critical functions through the use of voice commands and automation10.
Biometric identification & security applications
Voice print biometrics, also referred to as specific speaker recognition, is a related but distinct process to ASR. Rather than focusing on converting speech content to text form, voice print biometrics seek to identify and/or verify the identity of a particular speaker. This is achieved by braking the recording voice into segments, also referred to as formants, which undergo tonal analysis in order to generate a unique voice print11.
Voice print biometrics lend themselves to security applications in a similar way to fingerprint or iris biometrics. Voice pattern recognition is becoming increasingly common for access control, both in commercial and residential applications, although the technology implementation has not been without difficulties.
ASR and voice print biometric technology can also assist in criminal investigations. Much like fingerprint patterns, voice print patterns can be used on a comparative basis to identify persons suspected to be involved in criminal activity. A notable example of this approach is the voice print identification database Interpol deployed for use in 2018 to assist with the tracking of criminals12.
Risks when Adopting ASR Technology
While these applications provide the potential to improve process efficiencies, compliance, oversight, and functional performance in general, there are also increased risks and possible downsides associated with the implementation of these new technologies.
Even when operating normally, ASR home automation and virtual assistants collect enormous amount of personal data about their users13. This fact, considered alongside reports of somewhat “cavalier” attitudes towards the usage of collected data by “big tech”14 and the numerous reported incidents of personal privacy and data security breaches associated with the use of ASR virtual personal assistants15, the potential privacy and security consequences of having an Internet-connected listening device within the home cannot be understated.
As the uptake of ASR technologies in the workplace increases, questions of how to secure recorded voice data and the inherent privacy issues associated with the process are also of increasing relevance to organisations. Data collected by ASR technologies especially during customer interactions, has a high chance of containing personally identifiable information (PII) or other confidential information that requires protection. Data breaches caused by accidental user activation, automation following incorrect ASR translation, software glitches, the device being compromised by a malicious actor, or even a simple case of the product/application transmitting confidential data back to the manufacturer as part of an inbuilt “product enhancement” function may result in a notifiable breach of international or national regulations16.
While the stated purpose of workplace monitoring is generally to prevent unethical or illegal behaviour, the actual result of implementing a monitoring solution may have the opposite effect. Implementing ASR-based surveillance applications in the workplace, particularly if it is not expressly required by regulation, may indicate that the organisation distrusts their employees. This in turn may provoke negative reactions and result in reduced morale. Employees may become disengaged or find “work-arounds” to avoid the surveillance, introducing additional ethical and risk issues. Constant surveillance in the workplace may also lead to a compliance-focused culture, restricting innovation and flexibility in the workplace17.
Finally, when ASR results are just used for transcription purposes, the worst-case scenario is a set of unusable text. Conversely, when poor ASR results are used to automate systems the consequences can range from trivial (e.g. a misdirected phone call or inaccurate Internet search) to critical (e.g. an inaccurate voice pattern analysis on a biometric security control allowing access to a non-cleared person or poor quality ASR in law enforcement identifying an innocent conversation as containing red flags for terrorist activity). It follows that appropriate quality control and verification procedures are applied to any critical processes that rely upon ASR data.
Takeaways for Information Governance Professionals
- Be alert to proposed or actual implementations of ASR technology in your organisation. Ensure that a comprehensive due diligence assessment of the technology, including capabilities, benefits and risks is undertaken prior to implementation. Where ASR technology may be used (either directly or indirectly) as part of workplace surveillance it is highly recommended that organisations obtain and act on employee feedback as part of this due diligence process.
- Once implemented, ensure there is clear knowledge of what ASR data is being collected, stored, processed and disseminated by your organisation. As with other forms of data owned by the organisation, ASR data needs to be classified and secured with appropriate controls. There should also be a clear data lifecycle for this information that identifies when and how the ASR data is destroyed. This is particularly important if there is the possibility for personally identifiable information (PII) to be collected and transcribed during customer calls or other ASR capture activities.
- Organisations should ensure that any surveillance and/or recording of employee activity complies with necessary state, federal and international regulations and requirements (e.g. the Workplace Surveillance Act in New South Wales). Additionally, organisations should ensure that both employees and customers are aware of any ASR recording that takes place and provided with avenues to seek further information and provide feedback.
- In all cases, Australian organisations should be encouraged to assess their use of ASR technologies with reference to the Australian Privacy Principles to ensure that the organisation approaches the use of ASR data with a conscious focus on the security and ethical issues that may arise from the adoption of this technology.
Dr Peter Chapman is a Director in the KPMG Forensic Technology team and advisory board member of Information Governance ANZ.
- https://en.wikipedia.org/wiki/Speech_recognition – viewed 27 Jan 2020
- https://www.rev.com/blog/artificial-intelligence-machine-learning-speech-recognition – viewed 27 Jan 2020
- https://towardsdatascience.com/speech-recognition-is-hard-part-1-258e813b6eb7 – viewed 4 Feb 2020
- https://en.wikipedia.org/wiki/Siri – viewed 4 Feb 2020
- Computers that care: investigating the effects of orientation of emotion exhibited by an embodied computer agent, Brave et al 2005, International Journal of Human-Computer Studies.
- https://www.nice.com/engage/real-time-technology/real-time-speech-analytics/ – viewed 3 Feb 2020
- https://asic.gov.au/about-asic/news-centre/speeches/asic-regtech-voice-analytics-symposium/ – viewed 4 Feb 2020
- https://medium.com/syncedreview/google-ai-translatotron-can-make-anyone-a-real-time-polyglot-e7b6d616f5d2 – viewed 5 Feb 2020
- Feasibility of Automatic Speech Recognition for Providing Feedback During Tablet-Based Treatment for Apraxia of Speech Plus Aphasia, Ballard et al. 2019, American Journal of Speech Language Pathology
- https://aerospaceamerica.aiaa.org/departments/fly-by-voice/ – viewed 5 February 2020
- https://www.globalsecurity.org/security/systems/biometrics-voice.htm – viewed 27 January 2020
- https://theintercept.com/2018/06/25/interpol-voice-identification-database/ -viewed 29 January 2020
- https://www.theguardian.com/technology/2019/oct/09/alexa-are-you-invading-my-privacy-the-dark-side-of-our-voice-assistants – viewed 3 February 2020
- https://www.washingtonpost.com/technology/2019/05/06/alexa-has-been-eavesdropping-you-this-whole-time/ – viewed 3 February 2020
- https://www.bloomberg.com/news/articles/2019-04-10/is-anyone-listening-to-you-on-alexa-a-global-team-reviews-audio – viewed 3 February 2020
- https://www.oaic.gov.au/privacy/notifiable-data-breaches/when-to-report-a-data-breach/ – viewed 4 February 2020
- https://behavioralscientist.org/the-paradox-of-employee-surveillance/ – viewed 4 February 2020