Parakeet (extract text from audio)

Parakeet is a family of state-of-the-art Automatic Speech Recognition (ASR) models developed by NVIDIA in collaboration with Suno.ai. These models are built on the Fast Conformer architecture, designed to deliver a balance of high transcription accuracy and exceptional inference speed. They are widely recognized for outperforming much larger models (like OpenAI's Whisper) in efficiency while maintaining competitive or superior Word Error Rates (WER). Quality metric WER: 6.03 on Huggingface Open ASR Leaderboard.

MVSep provide two versions of model (v2 and v3):
Model page v2: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2
Model page v3: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3

Parakeet v2 (Parakeet TDT 0.6B v2)

Released as a highly efficient English-focused model, v2 established Parakeet as a leader in speed-to-accuracy ratio.

Language: English (en-US) only.
Size: 0.6 Billion parameters (600M), making it lightweight compared to the 1.1B parameters of previous versions.
Performance: It achieves industry-leading accuracy (approx. 6% WER on standard benchmarks) and is noted for being up to 50x faster than real-time.
Capabilities:
- Supports highly accurate word-level timestamps.
- Includes automatic punctuation and capitalization.
- Effective at transcribing non-speech sounds like music lyrics and spoken numbers.
- Can handle long-form audio (up to 11 hours in some configurations) using local attention mechanisms.

Parakeet v3 (Parakeet TDT 0.6B v3)

The v3 release marked the expansion of the efficient Parakeet architecture from English-only to a multilingual domain without increasing the model size.

Language: Multilingual (supprts 25 Euoropean languages, including English, Spanish, French, German, Russian, and others).
Size: Retains the compact 0.6 Billion parameter size.
Key Upgrade: It is trained on the massive Granary multilingual corpus (approx. 1 million hours of audio).
New Features:
- Automatic Language Detection: The model can identify the spoken language from the audio signal and transcribe it without manual prompting.
- High Throughput: Despite the added multilingual capabilities, it retains the ultra-fast inference speeds of the v2 TDT architecture.
- Versatility: It serves as a drop-in replacement for v2 for users requiring support for European languages while maintaining low latency and compute costs.

🗎 Copy link Use algorithm Demo

Parakeet (extract text from audio)

Parakeet v2 (Parakeet TDT 0.6B v2)

Parakeet v3 (Parakeet TDT 0.6B v3)

Site information

Company

Extra