Speech-to-text: a full guide for contact centers

Ever wondered what speech-to-text is, how it works, how it can be used in contact centers, and what the implications will be in the long run? Check out this blogpost to gain all insights.

Call Center ComplianceCloud Contact Center Software
schedule8 minute read

Buying habits and the way we interact with technology is changing. Contact centers need to adapt by both delivering services in the channels where the customers are and embrace technology to provide even better service and improve agent experience. This means jumping on both omni-channel and leveraging their data even further. And the data they have the most? Speech.

What is speech-to-text?

Speech-to-text software (also referred to as voice/speech recognition software) transcribes audio files into text, usually in a word processor to enable editing and search.  It’s the same technology that drives the tech space’s voice-enabled assistants, such as Siri, Alexa, or Google Assistant in the background.

Voice recognition works by breaking down spoken words to smaller samples (e.g. at a rate of 16khz, or 16000 samples per second ) and with complex algorithms combine and match the samples to local language pronunciation and start forming words. If you want to jump even deeper into the technical stuff, check out this post from Adam Geitgey.

Why should Contact Centers care? 

As contact centers live and breathe voice, it hardly surprises anyone that they have tons of data available. The bigger the operations get, the more recordings are stored, thus making it impossible to process without machine help. 

As a simple visualization, consider a Contact Center with 25 seats working in 2 shifts. If we calculate an average of 8-hour shifts with an average talk time of 45 min/hour, the amount of material collected per day averages at 300 hours. For a single person, this 1-day worth of material would already take over a month to process even with an accelerated 1.2x speed.

The question isn’t about getting data, but rather how to leverage that into business opportunities. Consider the possible implications from the following perspectives: QA, business development, and analysis.

Of course, the technology is still some way from being able to utilize this technology fully. However, we are on the right track and the more simple solutions (which can manage with higher error rates) can already be implemented with various languages. More on the different opportunities later on in this blogpost. 

Where are we now?

While the industry is, as discussed, still some way from getting the full benefits of the technology, big players such as Microsoft Azure, Google, IBM Watson, Amazon and Baidu (in China) invest billions in voice recognition software and deep learning. Furthermore, while the bigger players are mostly focusing on perfecting the accuracy of larger languages, English being at the forefront, there are several smaller local companies crafting voice recognition for their markets, as also seen in this year’s CCW in Berlin. For the Nordic region, companies like Inscripta (Finland), Gamgi (Sweden), and Capturi (Denmark) are working towards enhancing local language recognition. 

At LeadDesk, we have followed the development of Speech-to-text for a longer time already and created compatible endpoints with the most used services. This means that customers can already use software from several vendors and hook it up to LeadDesk quite easily. What this also means is that we already have the capabilities to start working on larger Speech-to-text projects with relatively fast deployment. 

The biggest challenge for speech-recognition? – Perfecting accuracy

The single biggest challenge is still enhancing the accuracy of transcriptions. As speech is far from monotone and languages vary across regions, accents, dialects, and grammar rules, perfecting the algorithms takes time and data. And a lot of both.

While some use cases already work with the current error rates, to gain deeper analysis capabilities, voice-recognition needs to get even better in terms of accuracy. The best voice recognition software can reportedly achieve as high as 97% accuracy in English (-us), while Google, Microsoft, and IBM Watson are all disclosing numbers around 95% (2018; assuming these numbers are higher today). This is however still lower in less spoken languages. While we are talking about small differences in percentages, the impacts are rather large.

The higher the accuracy rate becomes, the more possibilities it opens. As Andrew Ng, former head of science in Baidu and one of the most recognized experts in speech-recognition put it already in 2016; “As speech-recognition accuracy goes from 95% to 99%, we’ll go from barely using it to using all the time!”

(And if you are interested the standard for accuracy works as following: 

Word Error Rate = (Substitutions + Deletions + Insertions) / Number of Words Spoken)

In a contact center context, accuracy is further complicated with background noises and even low-quality calls, due to the narrowband phone calls. However, advancements in this can already be seen as mobile phone manufacturers and operators continue implementing VoLTE or Wideband audio (also known as “HD Voice”) technology. 

Where are we going and use cases?

As talked about in the previous chapter, the long term vision on how Speech-to-text will affect contact centers is broad. Most of the implementations are already available in text-based applications. However as discussed, by adding voice as an extra level the output of the text itself is error-prone, and thus gaining full advantage of all solutions is still for the future.

To get a better grasp of what can be achieved, we asked our Solution Architect, Samuli Pihlaja, to tell us more about how he sees the use cases for speech to text. He presents a 4 level framework to explain use cases.

Level 1 – Simple transcription: Reading is faster than listening

On its most basic level, the implication of speech-to-text is automatic transcription. The benefit is simple: On average reading is faster than listening and thus allows you to process more data

On average, adults read somewhere between 250-300 words per minute. Compare that to conversational speech averaging 120-150 words per minute. Thus reading through transcripts instead of listening to calls can save you time and let you go through double the number of calls. 

Transcriptions for human reading is also less affected by errors in the transcribing process, as our brains can quite easily fill in the gaps that might have gone wrong.

Just look at this sentence; Yuo cna porbalby raed tihs esaliy desptie teh msispeillgns.


How can this be used in contact centers?

  • QA and coaching: By being able to go through more material than now
  • Customer complaints: Easier and faster to get an overview of the call

Level 2 – Search: Keyword recognition

The second level of implications is applying simple search to text transcriptions. Think of it as scanning through a text document, finding keywords, and if needed jumping directly to the right spot in the call. 

How can this be used in contact centers?

  • Improve QA further – By finding relevant keywords (both positive and negative). Did the agent mention all the USPs in the script? Or did the agent use language that isn’t acceptable.
  • Search Call transcripts for conflict resolution – What was agreed upon in the call and did the customer get all the information needed.

Level 3 – Analysis: To understand patterns, we need help from machines

The third level in our framework is the analysis component. This not only requires higher accuracy but also the help of machines to understand patterns. By leveraging AI and machine learning we can start analyzing e.g. share of talk, topic distribution, tonality, and do mass analysis. 

Another example of this is interaction analytics, that uses AI to efficiently collect and analyze large volumes of data and interactions from the channels at use. The data can then be categorized into specific areas for further analysis. 

These include

  • Compliance: verifying that the necessary regulation aspects have been covered
  • Adherence: analyzing how sticking to or departing from the script affects the interaction
  • Problems: identifying and handling problematic interactions
  • Training requirements: identifying the areas that require training.

If you are wondering more about QA in Contact Centers, be sure to check out our ultimate guide for contact center QA.

A final and the most radical example of how contact centers can utilize speech recognition and machine learning is with the search of the “perfect call”. Finding and combining patterns of good and bad calls in order to analyze what makes or breaks your pitches.

Noticeable is that harnessing AI requires the machines to be fed data and start “learning”, so starting sooner will yield better results later. To get more insights on how AI can already help contact centers be more efficient, check out this blogpost. 

Level 4 – Real-time implementation: Help customers and agents during calls

The fourth level is leveraging speech in real-time implementations. The opportunities here are immense. Some simple implementations, such as voice-enabled call routing, are already possible. But this is only the beginning. 

Real-time implementations will help drive the trend of self-service (one of the megatrends in contact centers) further and allow customers to e.g. handle the simple interactions with a voice-bot over the phone whenever they need. Thus letting agents focus on the harder cases. 

Furthermore, we will see implementations, where agents can get real-time tips and tricks during the call based on factors such as keywords, topics, sentiment, and share of talk. Or why not get a popup with the right material directly to their screen from your help center when a specific problem is discussed. One company is even developing software that can listen to the pitch and wording of the customer and based on that tell when they are ready to buy.


In conclusion, speech recognition will bring a big change to the way contact centers are run. While the technology is still at a premature age, we will continue to see novel implications in the coming years.

Looking at what harnessing speech recognition can bring your business, it can be divided into the following topics:

  • Coaching
  • QA
  • Business development
  • Customer experience

Remember: getting the most out of machines will require “learning”, meaning the sooner you start the better results you will get later on.

At LeadDesk we are already compatible with several providers and thus can deliver accuracy to the level the providers do.

Our custom work team, Professional Services, is happy to discuss projects and possibilities for your business needs.

Contact us

Want more insights?