Why Cloud Transcription Beats Local Transcription

A question we get a lot: why does WhisperTyping run in the cloud and not on your PC? Wouldn't local be cheaper and more private?

Local has two real advantages: no network hop, and your audio never leaves your machine. On everything else, cloud wins. It's a hardware economics problem.

The Hardware Problem

Good speech recognition needs a powerful GPU with a lot of memory. Running the latest models with low latency takes a fast consumer GPU (think $1,500 and up), or a data-center GPU costing ten times that.

That GPU sits idle almost all the time. Even heavy dictation users speak maybe 20 minutes a day. The rest of the day, expensive hardware does nothing. In the cloud, the same GPU serves many users in parallel, so utilization is high. That's where the cost saving comes from.

Speed

Even with a good consumer GPU on your own PC, you're at maybe 10x real-time. A data center GPU gets you around 50x. The custom ASICs we use in production push that to around 300x.

Local consumer GPU

e.g. RTX 5080, ~$1,000

~10x real-time

Data center GPU

e.g. NVIDIA H100, ~$30,000

~50x real-time

WhisperTyping

custom ASICs

~300x real-time

How many minutes of audio each can transcribe in one minute of wall-clock time.

So a minute of audio takes around six seconds on a decent home GPU, about a second on a data center GPU, and a fifth of a second on our hardware. The gap matters because dictation latency is what makes voice feel as natural as typing.

We've set up dedicated private transcription servers for some of our most demanding customers in law and medicine, with consumer GPUs costing thousands of dollars. Even those don't reach the latency our free users get on the shared cloud.

Hidden Costs of Running Locally

Even if you own a powerful GPU, running a transcription model locally costs you something every time you use it:

Several GB of RAM and VRAM your other apps no longer have
Battery drain on laptops
Fans, heat, system slowdown while you dictate

With cloud transcription, your laptop streams a small audio signal and receives text back. That's it.

Open Source Local Alternatives

A few open source tools run Whisper on your own machine. Most of them use Whisper Small or Medium, not Large, because the big models are too slow on consumer hardware. Small and Medium make noticeably more errors, especially on accents, names, and technical terms. And even with the smaller model, local transcription is usually slower than our cloud.

Privacy

This is where local has a real advantage. With cloud, you have to trust your provider. Our position is simple: we do zero logging of your dictation. No audio stored, no transcripts stored, nothing used for training. See our privacy policy.

You have to take our word for that, which is a fair concern. For most people the tradeoff is acceptable. For users with strict compliance needs, we offer dedicated private servers and signed BAAs. See the medical page.

Who Should Still Run Locally

Air-gapped environments
Tinkerers who enjoy running models locally

For everyone else, cloud is cheaper, faster, and more accurate.

FAQ

Doesn't the network round trip make cloud slower?

In theory yes, in practice no. WhisperTyping streams audio while you speak, so the network time overlaps with your dictation. By the time you finish speaking, the text is already coming back.

Why don't you bundle the model and run it locally as an option?

We've looked at this. To match our accuracy and speed on a user's machine, they'd need a high-end GPU. Most don't have one, and the model would slow down their other work and drain their battery. A smaller model would solve those problems but with a real accuracy cost.

Is my audio recorded?

No. Zero logging of audio or transcripts. See our privacy policy.

Does WhisperTyping work offline?

No. It needs an internet connection. The bandwidth is modest (a few hundred KB per minute), so it works fine on weak Wi-Fi and cellular tethering.

Cloud vs Local Transcription