Rev’s Streaming ASR is Now Over 30% More Accurate…and Faster Too!
At Rev, we’ve been working hard to bring the accuracy benefits of our v2 asynchronous ASR model (announced in Q1 2022) to our v2 streaming ASR model. Now, we’re happy to announce the general availability of our v2 model for streaming ASR, enabling significant improvements in accuracy and latency.
Trained on millions of hours of transcribed speech and extensively tested in beta by our enterprise customers, the introduction of our v2 ASR model to our streaming API now provides customers with 30% better accuracy and improved partials latency than our previous model for live captions and other speech-to-text needs. With this, our streaming ASR accuracy is now nearly identical to our asynchronous ASR accuracy.
Technical Approach
At Rev, we believe that user satisfaction for a streaming (real-time) speech recognition system depends on four factors:
– Accuracy
– Latency (or Real-Time Factor)
– Partial hypotheses emission frequency
– Final hypotheses emission frequency
What makes streaming ASR particularly challenging is that all parameters that impact these metrics also have an impact on each other. For example, if we emit partials too frequently, then accuracy starts to degrade.
Our v2 ASR model uses a single neural network in an end-to-end (E2E) model. Under this approach, the system is trained as a single unit, ingesting audio directly and learning as it goes. By applying this model to our streaming ASR system, we have been able to improve both accuracy (with a relative gain of over 30%!) and partials’ latency at the same time. Read a technical overview of our v2 model.
Benchmarks
Our key metric for streaming ASR continues to be Word Error Rate (WER). In order to keep our results as reproducible as possible, the table below reports WER results using the same publicly accessible test suites that we use for asynchronous ASR testing.
v2 asynchronous | v1 streaming | v2 streaming | Relative gain | Gap | |
US finance | 8.94% | 15.39% | 9.99% | 35% | 10% |
International finance | 13.67% | 26.18% | 14.96% | 43% | 9% |
This shows that our v2 streaming model yields a ~38% (on average) improvement in WER overall. What is also interesting is the decreasing gap between asynchronous and streaming models (~10%), indicating that our streaming ASR accuracy is now almost as good as our asynchronous ASR accuracy.
Get Started with v2 Streaming ASR
The v2 streaming ASR model described above is our default production model for new users as of July 25, 2022. When no transcriber
option is provided, or if the transcriber
option is explicitly set to machine_v2
, the audio stream will be transcribed by the v2 ASR model.
Here’s an example of what the URL for the streaming connection would look like:
wss://api.rev.ai/speechtotext/v1/stream?access_token=YOUR-ACCESS-TOKEN-HERE&content_type=audio/x-wav&transcriber=machine_v
For existing pay-as-you-go (PAYG) and enterprise users, the v2 model automatically becomes the default from August 25, 2022 (for PAYG users) and January 25, 2023 (for enterprise users). Once defaulted to the v2 ASR model, it will no longer be necessary to specify transcriber: machine_v2
in API and SDK operations.
Two further points to note:
- The v2 streaming ASR model only supports English language input (for now).
- Transcription pricing for the v2 model is the same as under the previous model. For more information on pricing, please contact sales@rev.ai.
Learn more about our Streaming Speech-to-Text API and transcription options (including a summary of the v1 to v2 migration roadmap).
Our new streaming model is an outcome of thousands of hours of effort, and we would love to hear your feedback on it. Please write in and let us know at support@rev.ai.