Fine-tuning and deployment of custom ASR models

Mar 28, 2024

Intuitive user experiences are a key part of modern software products. At Unlikely AI, we recently undertook an ambitious project to integrate automatic speech recognition (ASR) capabilities into our platform. This technical blog post describes our journey, highlighting the challenges we faced with niche domain-specific terminology, and our pivot to develop a custom model built on state-of-the-art open-source solutions.

Our initial investigation of ASR technology was driven by the desire to allow users to interact with our platform through voice as well as text. To achieve this, we integrated Deepgram’s speech-to-text API into our platform. Deepgram stood out for its streaming functionality and seamless integration with front-end frameworks, particularly React, making it a promising initial choice for our requirements. Its performance seemed impressive, providing accurate transcriptions for the majority of texts right out of the box.

Despite Deepgram’s overall effectiveness, we encountered a significant hurdle when dealing with specialised terminology and uncommon proper nouns. Transcription quality noticeably degraded in these scenarios, impacting the clarity and accuracy of some key pieces of functionality in our platform. Deepgram’s solution to this problem is keyword boosting — a feature designed to enhance the recognition of new vocabulary that the model had not previously encountered. While this approach helped to some extent, it fell short of our expectations for providing a reliable transcription experience.

As we scaled up our experiments, we discovered the limitations of keyword boosting. It did improve model performance by increasing the likelihood of recognizing specific words or phrases, but its effectiveness diminishes as the complexity and volume of specialised vocabulary increase. This limitation prompted us to reconsider our approach.

Realising that keyword boosting would be insufficient led us to explore alternative solutions. While most current ASR providers, including Deepgram, offer some degree of customization, they often do not support the development of fully custom models without significant financial investment.

Faced with these challenges, and based on our understanding that fine-tuning a pre-trained ASR model on bespoke data will give us the best performance, we decided to take control of our destiny by training a custom model. This decision was fueled by the availability of state-of-the-art open-source models in the ASR domain. These open-source solutions offer the flexibility, power, and cost-effectiveness required to develop a model tailored to our specific needs, including the ability to accurately transcribe specialised terminology and uncommon proper nouns.

Fine-tuning Process

Fine-tuning a model like Whisper involves adjusting the pre-trained model weights slightly to perform better on a specific task — in our case, improving accuracy for specialised terminologies within our industry. This process requires a dataset annotated with the correct transcriptions, including the specialised terms that we want the model to learn.

To begin with, we collected a small dataset of audio recordings and their corresponding transcripts which included paragraphs of text with the specialised terminology and proper nouns relevant to our domain. To our surprise, the model performed well even in field testing with our limited dataset. This showcases the generalisation ability and data efficiency of fine-tuning a large pre-trained ASR model.

Example Code for Fine-tuning Whisper

Below is a simplified example of how one might approach fine-tuning the Whisper model with a custom dataset. This example assumes you have installed the necessary libraries and prepared your dataset in the required format.

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch

# Load the pre-trained Whisper model and processor
model_name = "openai/whisper-large"
model = WhisperForConditionalGeneration.from_pretrained(model_name)
processor = WhisperProcessor.from_pretrained(model_name)

# Prepare your custom dataset
# This should be a list of dictionaries with "audio" (path to your audio file) and "text" (your transcription) keys
custom_dataset = [{"audio": "path/to/audio1.wav", "text": "Your transcription here."}, ...]

# Fine-tuning settings
learning_rate = 2e-5
epochs = 3
batch_size = 8
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Loop over your dataset to fine-tune the model
model.train()
for epoch in range(epochs):
    for batch in custom_dataset:
        # Process the audio file
        input_features = processor(batch["audio"]

This code snippet provides a basic framework for fine-tuning. In practice, you’ll need to manage data loading and batching more efficiently, possibly using tools like torch.utils.data.DataLoader. Additionally, handling longer audio files may require segmenting them into smaller chunks that fit the model's maximum input size.

Challenges and Considerations

Fine-tuning a large model like Whisper requires significant computational resources, typically needing GPUs or TPUs to complete in a reasonable timeframe. Moreover, the quality and quantity of the fine-tuning dataset are vital; insufficient or poorly annotated data can lead to suboptimal outcomes or even exacerbate the model’s existing biases.

After training, it’s essential to evaluate the fine-tuned model thoroughly, comparing its performance on a separate test set to ensure that it has indeed learned to handle the specialised terminology better without losing its general transcription capabilities. This is a phenomenon known as catastrophic forgetting where a model begins to perform poorly on data it was originally trained on, but does not see during fine-tuning.

Algorithm for Sentence Separation

Now that we had a well-performing speech recognition model, we wanted to focus on user experience. Running the ASR in a streaming fashion on short segments of speech allows the user commands to be processed with a shorter delay, leading to a smoother user experience. The problem now is choosing when to separate the audio as the ASR accuracy may drastically decrease if the speech is cut mid-sentence, and the entire meaning cannot be properly reconstructed from segments of transcribed text. Our investigations in this domain led us through a series of experiments, ranging from basic signal processing techniques to advanced deep learning methodologies in pursuit of a reliable solution.

Initial Approaches to Sentence Detection

We began by exploring simple techniques which leverage the natural pauses in speech as markers for sentence boundaries. Using the volume of the audio signal and implementing low-pass filters, we attempted to identify these pauses, but had limited success: the simplicity of these methods is appealing, but we found that they fall short in live audio environments which include background noise.

Our next attempt was to use Voice Activity Detection (VAD), a deep learning approach which focuses on the presence of speech to improve sentence detection. Despite its advanced nature, we encountered inherent limitations with VAD, particularly its inability to facilitate word-by-word live transcription and its vulnerability to ambient noise. These obstacles made it unsuitable for our use case.

Leveraging Whisper for Sentence Separation

Our adoption of OpenAI’s Whisper model brought a new dimension to our efforts. Whisper’s capability to output transcriptions at the word level introduced a potential pathway to reliable sentence separation. The challenge then shifted to managing the continuous stream of audio bytes via WebSocket and ensuring the accuracy and coherence of transcriptions in real-time.

While OpenAI’s provision of timestamps for words or sentences based on attention weights and forced alignments appeared to be a solution, it lacked the reliability needed for seamless sentence detection. The interpolated timestamps were not only absent during model training but also prone to inconsistencies, leading to disruptions in the transcription process.

The Majority Vote Algorithm: A Novel Solution

Our breakthrough came with the conceptualisation and implementation of a majority vote algorithm, which is designed to address scenarios where the model outputs different transcriptions as new audio bytes are coming in. The algorithm followed these steps:

  1. Candidate Word Tracking: for every position within a sentence, maintain a list of potential words predicted by the model, effectively capturing the dynamic nature of live transcription.

  2. Vote Aggregation: as new inferences emerge with the latest audio bytes appended, update the votes for candidate words, gradually building towards a consensus.

  3. Word Finalisation: establish a word at a specific index as finalised once its vote count surpasses a predetermined threshold, set to 4 in our implementation which resulted in the highest accuracy end-to-end.

  4. Sentence Punctuation Detection: analyse the punctuation of each word to determine sentence boundaries, using punctuation as a reliable indicator for the end of a sentence.

  5. Sentence Finalisation: consider a sentence complete when all its constituent words have been finalised, thus ensuring a consistent and accurate transcription.

This algorithm, distinguished by its accuracy and low latency, led to a significant improvement in user experience. This methodology not only enhanced the robustness of our transcription process against environmental noise but also aligned with our goal of delivering real-time, word-by-word transcription in live-streaming contexts.

Conclusion

The path to seamlessly incorporating Automatic Speech Recognition (ASR) technology into our platform has been both a formidable challenge and a source of valuable insights. We initially leaned on Deepgram’s speech-to-text API, drawn by its streaming functionality and ease of integration, especially with React frameworks. However, the limitations of keyword boosting prompted us to rethink our strategy. Changing to a custom model, in spite of the formidable engineering challenges, underscored our dedication to delivering a solution that not only meets general transcription needs but excels in handling the nuances of domain-specific language.

The challenges of integrating ASR technology, particularly in live streaming environments, highlighted the necessity for a robust method to ensure sentence separation and transcription accuracy amidst varying audio conditions. Deploying our own method led to significant improvements over off-the-shelf solutions. This exploration and eventual pivot to a custom-developed ASR model is a testament to our team’s culture of innovation and resilience. We are proud of the ASR system we have deployed, and while we expect to find areas that need work down the line, we are excited to be on the quest to keep improving it.