At its core, automatic speech recognition (ASR) – also called speech-to-text or automated transcription – is simply the recognition and translation of spoken language into text. There’s a lot of methodology and technology that goes into it, but the end result is a textual record of an audio or video file.

Without ASR, if you wanted to transcribe an audio file, you (or someone else) would need to manually listen to a file and write out all the words. That might be fine for a quick minute or two of audio, but if you have an hour of someone talking? It’s a time-intensive and laborious process.

However, there are several ASR services out there. Before you decide which one to use, it’s good to know how to choose an ASR service.

Of course, your decision will ultimately be based on what’s most important to you. Typically, people tend to consider the following three traits the most:


You’ve likely seen a comical error in closed captioning during a television show or a live sports broadcast (if not, here are 35 of them for your enjoyment). While this is funny to witness, you definitely don’t want to see it in your own work.

Besides, if you’re paying money for an automatic speech recognition service, you’d expect it to be accurate. You want high quality and a low word error rate (WER). And there’s no better ASR service for that than Check it out below:

Accuracy comparison (1 – WER)

Google (video model)81.9%
Google (audio model)74.4%

At 82.7 percent accuracy, is the most accurate service out there. Google Video is a little bit behind, while Google Audio and Amazon Transcribe are much less accurate. These numbers are from more than 30 hours across 100+ files, meaning it’s a large sample of words to go through. The files covered a variety of topics (business, technical, education, etc.), ambient noises and accents. Whether you’re looking for transcription for your website, closed captioning for your video, or simply notes for your own use, you’ll know you’re getting the most accurate results from

Turnaround Time

Think about when you order food online. How would you feel if you had a cumbersome ordering process, with a ton of different screens to enter data and slow loading times? Then, when you finally make your purchase, they tell you it’ll be a wait of 24 hours – or maybe even longer.

You’d probably be pretty furious, right? Why should something that can be done in minutes take longer than that?

We don’t like waiting around either, which is why we provide transcripts back in less than a minute for short files, and up to five minutes for longer files. In the same amount of time it takes to listen to your favorite song (unless your favorite song is something excessively long, like “American Pie”), you’ll have your file fully transcribed.

Note: we are working on our real-time or streaming ASR service and will launch it this year.

Speaker Identification and Diarization

This one is particularly critical for audio and video files that contain multiple speakers. You may be transcribing an interview with a panel of five people, or have a video with a group all chatting at once, trying to make their voices heard among the others.

Speaker diarization takes an audio stream and partitions it into segments by speaker identity. Essentially, it can provide insight into who spoke and when they did so. The speaker segmentation aspect looks at finding the points of speaker change within an audio file; the second looks to group those points together, based on the characteristics of each speaker.

There’s a lot that goes into speaker identification, but the important thing to note is that not all services offer it. Imagine receiving a transcript back with no idea who’s talking. If you only need bare bones notes, that might be useful, but in most cases, you’ll want to know who’s talking at various points of your file. is more accurate with speaker identification and diarization and includes it within its service. Most other services either don’t feature it at all or only have a beta offering that results in clunky, inaccurate identification. That can lead to incorrectly quoting or misidentifying a speaker – which could spell big trouble if the content could be viewed as controversial.

Feature comparison

Google (video model)BetaBeta
Google (audio model)BetaBeta

Other Factors to Consider

While accuracy, turnaround time, and speaker identification and diarization are the most frequently asked about, there are a number of other considerations when determining which ASR service to use.

Here are a few people often might overlook – even though they can be just as important:

  • Punctuation and sentence structure. This seems like a feature that should be included with every ASR service, but that’s not the case. Sometimes, you’ll just receive text with no capitalization, punctuation, or paragraph breaks. That means a lot of additional labor on your end to turn it into something more legible. At, we offer punctuation and sentence structure so that your transcription is far easier to read.
  • Custom dictionaries. Do you work in an industry with a lot of technical terms? Or maybe you interact with people and companies that have less common names or unusual spellings. How irritating would it be to have those words constantly underlined as misspelled or autocorrected to something different, simply because your tool’s dictionary didn’t know what it was? Having the ability to customize your dictionary makes your life a lot simpler. We have the option to submit 10,000 custom words with every file, meaning you can get all the nouns and technical terms right the first time.
  • Training custom models. For larger clients, offers the ability to create custom models. This allows for more customization and integration in your work.
  • Real-time vs. offline. Real-time, or synchronous, automatic speech recognition blocks all other processes while transcribing. Offline, or asynchronous, automatic speech recognition will allow you to do other things while the file is being finalized. Synchronous can allow for real-time feedback, while asynchronous lets you work on other tasks as the file is finishing up. Today, is only asynchronous. We have our streaming service in alpha, with plans to launch a beta soon.
  • Ease of use. You’ve likely heard the old rule from Malcolm Gladwell that you need 10,000 hours of deliberate practice to become world-class in any field. Sometimes, it feels like we need 10,000 hours of working with technology to fully understand how to use it. Luckily, that’s not the case with Our users constantly tell us we’re one of the easiest services to get started on. In fact, they go from signup to submitting orders in a couple of hours. That’s great if you don’t have the time to learn a whole new toolset or commands – and chances are, if you’re using an ASR service, it’s because you value your time.
  • Output formats. ASR services can provide a variety of output formats for their users. Ideally, you want multiple formats that are frequently accepted and easy to work with. offers JavaScript Object Notation (JSON) and TEXT files, making it simple to transfer transcriptions to a blog or website. Coming soon, you’ll have the choice of SubRip Subtitle (SRT) and a Web Video Text Track (VTT) files for captions.

When you’re thinking of how to choose an ASR service, you need to ask yourself what’s most important to you. Is it accuracy? Turnaround time? Ease of use? The ability to customize your vocabulary?

Answering those questions is a good first step. When you’ve determined the characteristics you want in an ASR tool, you can make your decision. To give yourself a head start, you can get five hours of credit with absolutely free.

Try it free