At Rev, we’ve been working hard to bring the accuracy benefits of our v2 asynchronous ASR model (announced in Q1 2022) to our v2 streaming ASR model. Now, we’re happy to announce the general availability of our v2 model for streaming ASR, enabling significant improvements in  accuracy and latency.

Trained on millions of hours of transcribed speech and extensively tested in beta by our enterprise customers, the introduction of our v2 ASR model to our streaming API now provides customers with 30% better accuracy and improved partials latency than our previous model for live captions and other speech-to-text needs. With this, our streaming ASR accuracy is now nearly identical to our asynchronous ASR accuracy.

Technical Approach

At Rev, we believe that user satisfaction for a streaming (real-time) speech recognition system depends on four factors:

– Accuracy

– Latency (or Real-Time Factor)

– Partial hypotheses emission frequency

– Final hypotheses emission frequency

What makes streaming ASR particularly challenging is that all parameters that impact these metrics also have an impact on each other. For example, if we emit partials too frequently, then accuracy starts to degrade.

Our v2 ASR model uses a single neural network in an end-to-end (E2E) model. Under this approach, the system is trained as a single unit, ingesting audio directly and learning as it goes. By applying this model to our streaming ASR system, we have been able to improve both accuracy (with a relative gain of over 30%!) and partials’ latency at the same time. Read a technical overview of our v2 model.

Benchmarks

Our key metric for streaming ASR continues to be Word Error Rate (WER). In order to keep our results as reproducible as possible, the table below reports WER results using the same publicly accessible test suites that we use for asynchronous ASR testing.

v2 asynchronousv1 streamingv2 streamingRelative gainGap
US finance8.94%15.39%9.99%35%10%
International finance13.67%26.18%14.96%43%9%

This shows that our v2 streaming model yields a ~38% (on average) improvement in WER overall. What is also interesting is the decreasing gap between asynchronous and streaming models (~10%), indicating that our streaming ASR accuracy is now almost as good as our asynchronous ASR accuracy.

Get Started with v2 Streaming ASR

The v2 streaming ASR model described above is our default production model for new users as of July 25, 2022. When no transcriber option is provided, or if the transcriber option is explicitly set to machine_v2, the audio stream will be transcribed by the v2 ASR model.

Here’s an example of what the URL for the streaming connection would look like:

wss://api.rev.ai/speechtotext/v1/stream?access_token=YOUR-ACCESS-TOKEN-HERE&content_type=audio/x-wav&transcriber=machine_v

For existing pay-as-you-go (PAYG) and enterprise users, the v2 model automatically becomes the default from August 25, 2022 (for PAYG users) and January 25, 2023 (for enterprise users). Once defaulted to the v2 ASR model, it will no longer be necessary to specify transcriber: machine_v2 in API and SDK operations.

Two further points to note:

  • The v2 streaming ASR model only supports English language input (for now).
  • Transcription pricing for the v2 model is the same as under the previous model. For more information on pricing, please contact sales@rev.ai.

Learn more about our Streaming Speech-to-Text API and transcription options (including a summary of the v1 to v2 migration roadmap).

Our new streaming model is an outcome of thousands of hours of effort, and we would love to hear your feedback on it. Please write in and let us know at support@rev.ai.