Building a Speech Recognition System vs. Buying or Using an API
If your company operates in a domain that requires frequent speech to text transcriptions, you’re probably wondering whether there’s a long term payoff in building your own automated speech recognition system (ASR) vs. buying on-demand access via a service such as Rev.ai.
It’s a tricky question, and the decision may skew very clearly one way or the other if you’re on either of the extreme ends of usage. However, we’ve found that for almost all businesses with typical usage, paying for an on-demand speech recognition service is a much higher value than building your own. Here’s why.
Development Team Costs
The first thing you’ll need to realize about building an ASR is that it’s not a simple task that you can offshore or hand off to inexperienced workers. The technology behind ASR is machine learning, which involves a huge amount of math, data, and software domain expertise.
You’ll typically have to hire more than one person because the chances of finding all of this domain expertise in a single individual is rare, not to mention that even if they have it, building an ASR is not typically a one-man job. You’ll need to hire multiple people to distribute the workload. Usually the people you’ll need to hire to build an ASR system will include at least one of each of a Machine Learning Scientist/Researcher (PhD level), a software engineer for creating APIs and deployment, and a data engineer to help warehouse and manage all the text, audio, and other training data.
The salary range you’ll need to pay for these positions, at least if you want to attract decent engineers, will easily be in the range of $100k-200k per person and more if you want to pay for more experienced people. Therefore, you’re looking at a yearly annual expenditure of at least $300k-600k for a lean, bare-bones development team.
“Well,” you might be thinking, “that’s not so bad. I’ll just pay them for 6 months to 1 year and then I’ll have my ASR system finished for use in perpetuity.”
Not so fast. While it’s a nice dream, the world of software development is inherently messy. Even if such a small team was able to build a high quality ASR system in such a short time frame (highly unlikely), it’s not like it’s something you can just build and then expect to have it run smoothly. Machine learning models like this operate in feedback loops.
There will always be small bugs that arise in production, issues that arise as your service hits greater scale, and new data that gets injected into your model. This last one is crucial – the need to update the model on new data, and the occurrence of phenomena such as model drift, will almost certainly mean that a live production system such as your ASR one will need to be retrained regularly.
All this to say that the development team is not a one-time cost, it’s a crucial and ongoing cost that will most likely exist for the lifetime of your ASR system’s use. So count on an expenditure of at least $300k-600k per year.
Unfortunately, the costs of building an ASR system don’t stop at the development team. Most current, state-of-the-art ASR models are actually deep learning models meaning they’re large neural networks that take tons of data to properly train. Think millions or billions of data points.
That means you’ll need a huge library of audio files (and corresponding text transcripts) to effectively train your model. Unless you’re a search engine giant such as Google, or a dedicated ASR company such as Rev.ai, which has access to years of transcription data logged by its team of 50,000 human transcriptionists, you most likely won’t have access to data at the scale needed to train one of these systems.
Of course, you could go out and gather that data, either by paying for it (think tens of thousands of dollars to license certain datasets), scraping it from the web (thousands of hours of running background scripts), or curate it from your customers (years of back and forth, human interaction).
Clearly, none of these are ideal solutions for a small to medium sized business. Even for a large business, the hassle is often much more than it’s worth. That’s why corporate juggernauts such as Bloomberg, VICE, Loom, and others use Rev to generate their transcriptions.
Let’s assume for the sake of argument that you do have the data necessary to train a high quality ASR model. You’ll still need the infrastructure to train your model. Here is one current state-of-the-art ASR setup. If you don’t want to read the whole paper, there’s also a nice summary of current state-of-the-art ASR models here. Here’s an excerpt about training the model from that same summary:
The network has 12 residual blocks, 30 weight layers, and 67.1M parameters. Training was done using the Nesterov accelerated gradient with learning rate 0.03 and momentum 0.99. The CNN was also implemented on Torch using the cuDNN v5.0 backend. The cross-entropy training took 80 days for 1.5 billion samples using a Nvidia K80 GPU with a 64 batch size per GPU.T
That 80 days of training would be a huge ask for any company, and note that that was only the successful training run. They likely had multiple false starts as well. Note also the dataset size: 1.5 billion samples! Finally, they required a Nvidia K80 GPU to do the training.
This sort of hardware isn’t cheap, although they likely could have sped up the training process by parallelizing across multiple, more powerful GPUs. That brings down the training time but also significantly ups the infrastructure spend. Like all things, it’s a tradeoff, and probably one you don’t want to concern yourself with.
Cost of Buying from a Pre-Built Service or API
Now that we’ve outlined some of the major costs associated with building an ASR system, it’s only fair to compare it to the cost of the alternative: using a service that a dedicated team has already built. Rev.ai operates according to a very simple pricing model. You can either pay as you go for just over 3 cents per minute of audio/video transcribed, or if you are an enterprise client you can get that same service for $1.20 per hour. That’s about a 28% discount.
If you’re like most businesses, you’ll probably only be transcribing maybe a few thousand hours of audio per year. At the $1.20 per hour rate, that’s only a few thousand dollars per year, less than you’d spend to hire a single developer for a single month! However, even if your usage is extremely high, you’re unlikely to cross the threshold where the cost-benefit analysis turns in your favor in terms of building your own speech recognition system. Let’s take a look again at the cost of building your own system:
And that’s for a fairly minimal system without all the bells and whistles. Remember, most of the state of the art systems had 4 – 10+ authors on their research papers alone, and that’s probably not taking into account the software and other engineers supporting the team but not working directly on the algorithm itself.
So at the upper end of that cost range, you would need to be transcribing 591,666 hours of audio per year for the balance to tip in favor of building your own system. And even at that point, all the headaches of managing a dedicated software team may not be worth it.