Which transcription model should you use?

We looked at real Meeting Bot usage across production teams to see which models developers actually choose, and why.

Back to Blog
June 18, 20269 min read
AP
Anita PijffersMarketeer
Which Model Should You Use?

Which transcription model should you use?

We looked at real Meeting Bot usage across production teams to see which models developers actually choose, and why.

Choosing a transcription model for a meeting bot is not a theoretical benchmark problem. It is a production trade-off between cost, accuracy, speaker labels, latency, and how messy your users' meetings are.

We looked at the last three months of Skribby model usage across production meeting bots and recordings. The pattern is clear: developers are not all optimizing for the same thing.

Some want the cheapest possible transcript. Some need realtime output. Some only want raw audio. And increasingly, teams are picking models that handle real meeting audio better: accents, interruptions, speaker changes, phone audio, names, numbers, and domain-specific terms.

Usage distribution over the last three months

This data covers recent Skribby usage across EU and Japan production regions.

ModelUsage shareNotes
AssemblyAI Universal-234.7%Most used overall
Soniox STT Async v425.4%Previous Soniox async version
No transcription13.8%Audio/recording only
Deepgram Nova-26.9%Common async option
Groq Whisper Large v3 Turbo5.5%Low-cost transcript option
Soniox STT Real-Time v44.2%Previous Soniox realtime version
ElevenLabs Scribe v23.4%Strong newer async usage
Deepgram Nova-3 Multilingual2.5%Newer Deepgram async option
Soniox STT Async v51.5%New Soniox model, already supported
Deepgram Nova-2 Realtime1.3%Realtime alternative
Other models1.0%Smaller long-tail usage

The biggest change from earlier usage data is AssemblyAI moving into the top spot. Soniox is still very strong overall: when you combine Soniox async, realtime, v4, and v5 usage, Soniox accounts for 31.1% of model usage.

The short version

If you want the safest default, start with Soniox or AssemblyAI.

If you care mostly about cost, Whisper is still attractive.

If you need realtime transcription, Soniox Real-Time is the most-used realtime option in our data.

If you want no vendor transcription at all, Skribby lets you run bots without transcription and handle the audio yourself.

That last point matters more than people expect. 13.8% of usage runs with no transcription model. These teams are usually doing one of three things:

  • recording meetings and processing the audio later
  • sending audio into their own transcription stack
  • using Skribby for the meeting bot layer, not the transcription layer

That is a valid architecture. A Meeting Bot API should not force transcription if all you need is reliable meeting capture.

Why AssemblyAI is now the most used model

AssemblyAI Universal-2 made up 34.7% of the last three months of usage.

That does not automatically mean it is the best model for every use case. Usage is shaped by customer mix, defaults, pricing, and product requirements. But it does show that many production teams are willing to pay for a more feature-rich transcription model when the meeting output matters.

AssemblyAI is a good fit when you need:

  • speaker labels
  • stable general-purpose transcription
  • profanity filtering
  • custom vocabulary
  • a mature async transcription path

For customer calls, recruiting notes, compliance review, and internal meeting intelligence, that feature set can be worth the extra cost.

Soniox is still the strongest all-rounder

Soniox remains one of the most important models in Skribby.

Combined Soniox usage across v4 async, v4 realtime, v5 async, and v5 realtime was 31.1% over the last three months.

The appeal is simple: Soniox gives you a strong balance of price, quality, speaker separation, multilingual support, and realtime availability.

For meeting bots, that mix is practical. Real meetings are noisy. People interrupt each other. They switch languages. They mention names, numbers, products, and weird internal acronyms. A model that looks good on clean audio can still struggle when it hits a normal sales call or team meeting.

Soniox is built closer to that reality than many generic transcription options.

Soniox v5 is now supported in Skribby

Soniox recently introduced new v5 models:

  • soniox/stt-async-v5
  • soniox/stt-rt-v5

Skribby supports both.

We are already seeing early usage in production. Soniox v5 Async is appearing in the model mix, and Soniox v5 Real-Time is starting to show up as well.

Soniox describes v5 as a major upgrade for real-world speech: better accuracy, stronger speaker separation, improved language identification, better handling of names and alphanumeric strings, and improved realtime endpointing.

Those upgrades matter specifically for meeting bots. The hardest parts of meeting transcription are rarely the easy sentences. They are the moments where someone says a customer name, a product code, a date, an account number, or an action item while two people are talking over each other.

That is where better speech models become better product experiences.

Realtime is still a minority, but it matters

Realtime models made up 5.8% of usage over the last three months.

That is still a minority. Most teams can wait until the meeting ends, because async transcription is simpler and often cheaper.

But realtime usage is strategically important. It powers products that need to act during the meeting:

  • live captions
  • AI meeting copilots
  • sales coaching
  • customer support escalation
  • compliance monitoring
  • live CRM enrichment
  • agent workflows that need immediate context

Soniox realtime appears clearly in the data, with v5 Real-Time now supported and starting to show up in usage.

If your product needs live transcript events, realtime model quality is not just a transcription decision. It affects latency, interruption handling, endpointing, and how natural the product feels.

Whisper is still useful, but know the trade-off

Groq Whisper Large v3 Turbo accounted for 5.5% of usage.

Whisper remains attractive because it is cheap and broadly capable. It is a good fit when you need a plain transcript and do not care much about speaker labels or realtime output.

That makes it useful for:

  • content repurposing
  • searchable archives
  • simple summaries
  • internal analysis pipelines
  • teams that want the lowest transcription cost

The trade-off is speaker diarization. If your product needs to know who said what, Whisper is usually not the right default.

Deepgram and ElevenLabs have meaningful niches

Deepgram Nova-2 and Nova-3 together account for a meaningful slice of usage, especially when you include realtime variants.

Deepgram is often picked by teams that care about speed, developer experience, or realtime use cases. Nova-3 Multilingual is also starting to show up in production usage.

ElevenLabs Scribe v2 is also showing real adoption. It is becoming part of the practical model set teams evaluate for meeting transcription.

The takeaway: there is no single universal winner. The best model depends on the product you are building.

How to choose a model for your meeting bot

Start with the product requirement, not the model name.

If you need the best default balance:

Use Soniox or AssemblyAI. Both are strong production choices, and both handle more than just plain text.

If you need realtime:

Use Soniox Real-Time or Deepgram Realtime. If you are building a live AI agent or meeting copilot, do not pick an async model and hope to work around it.

If you need the lowest cost:

Use Whisper, or run with no transcription and process the audio yourself.

If you need multilingual meetings:

Evaluate Soniox v5, Deepgram Nova-3 Multilingual, and Whisper on your actual audio. Do not rely only on provider language lists.

If you need speaker-aware meeting intelligence:

Avoid models without diarization. Speaker labels are not a nice-to-have if your product extracts action items, objections, decisions, or commitments.

Why model choice matters less when the API normalizes output

One of the reasons Skribby supports multiple transcription providers is that customers should not have to rebuild their meeting bot integration every time they test a model.

Skribby handles the meeting bot layer for Google Meet, Microsoft Teams, and Zoom. It also normalizes transcription output across providers so you can switch models without rewriting your whole integration.

That matters because the right model can change as your product changes.

You might start with Whisper for cost. Move to Soniox for speaker labels. Add realtime transcription later for live agents. Test AssemblyAI for a customer segment that needs specific features. Or bring your own transcription key and run your own provider relationship.

A good Meeting Bot API should make those choices easy.

Our recommendation

For most teams building with meeting bots, start with Soniox v5 or AssemblyAI Universal-2.

Pick Soniox v5 if you care about the balance between cost, multilingual speech, speaker-aware output, and realtime support.

Pick AssemblyAI if you want a feature-rich async model that is already heavily used in production across our customer base.

Pick Whisper if price matters more than speaker labels.

Pick no transcription if Skribby is only your meeting capture layer and you want to process audio somewhere else.

The important part is not picking the model with the best marketing page. It is picking the model that matches your product's actual meeting audio.

FAQ

What is the most used transcription model in Skribby right now?

AssemblyAI Universal-2 is the most used model over the last three months, with 34.7% of model usage.

Is Soniox still popular?

Yes. Combined Soniox usage across v4 and v5, async and realtime, represents 31.1% of model usage in the last three months.

Does Skribby support Soniox v5?

Yes. Skribby supports soniox/stt-async-v5 and soniox/stt-rt-v5.

How much usage is realtime?

Realtime models account for 5.8% of usage over the last three months.

Why do some bots use no transcription model?

Some teams only need the meeting recording or realtime audio stream. Others process the audio later with their own transcription stack or bring their own provider setup.

Can I switch transcription models without changing my integration?

Yes. Skribby normalizes transcription output across providers, so you can test or switch models without rebuilding your meeting bot integration.

What model should I use for live AI agents?

Use a realtime model. Soniox Real-Time is the clearest realtime option in our recent data, and Soniox v5 Real-Time is now supported.

Ready to Build?

Start Using Skribby Today

Join developers who are building the future of meeting intelligence. Deploy your first meeting bot in under 5 minutes with our simple API.

No credit card required
$0.35/hour pricing
30+ languages