The MVSep Wind model produces high-quality separation of music into a wind part and everything else. The MVSep Wind model exists in 2 different variants based on following architectures: MelRoformer and SCNet Large. Wind includes 2 categories of instruments: brass and woodwind. More specific we inluded in wind: flute, saxophone, trumpet, trombone, horn, clarinet, oboe, harmonica, bagpipes, bassoon, tuba, kazoo, piccolo, flugelhorn, ocarina, shakuhachi, melodica, reeds, didgeridoo, mussette, gaida.
Quality metrics
Algorithm name
Wind dataset
SDR Wind
SDR Other
MelBand Roformer
6.73
16.10
SCNet Large
6.76
16.13
MelBand + SCNet Ensemble
7.22
16.59
MelBand + SCNet Ensemble (+extract from Instrumental)
The MVSep Brass is a high quality model for separating music into brass wind instruments and everything else. List of instruments: trumpet, trombone, horn, tuba, flugelhorn, untagged brass.
The MVSep Woodwind is a high quality model for separating music into woodwind instruments and everything else. List of instruments: oboe, saxophone, flute, bassoon, clarinet, piccolo, english horn, untagged woodwind.
The MVSep Percussion is a high quality model for separating music into percussion instruments and everything else. List of instruments: bells, tubular bell, cow bell, congas, celeste, marimba, glockenspiel, tambourine, timpani, triangle, wind chimes, bongos, clap, xylophone, mallets, metal bars, wooden bars.
MVSep DnR v3 is a cinematic model for splitting tracks into 3 stems: music, sfx and speech. It is trained on a huge multilingual dataset DnR v3. The quality metrics on the test data turned out to be better than those of a similar multilingual model Bandit v2. The model is available in 3 variants: based on SCNet, MelBand Roformer architectures, and an ensemble of these two models. See the table below:
The algorithm restores the quality of audio. Model was proposed in this paper and published on github.
There are 3 models available: 1) MP3 Enhancer (by JusperLee) - it restores MP3 files compressed with bitrate 32 kbps up to 128 kbps. It will not work for files with larger bitrate. 2) Universal Super Resolution (by Lew) - it restore higher frequences for any music 3) Vocals Super Resolution (by Lew) - it restore higher frequences and overall quality for any vocals
Algorithm AudioSR: Versatile Audio Super-resolution at Scale. Algorithm restores high frequencies. It works on all types of audio (e.g., music, speech, dog, raining, ...). It was initially trained for mono audio, so it can give not so stable result on stereo.
Audio generation based on a given text prompt. The generation uses the Stable Audio Open 1.0 model. Audio is generated in Stereo format with a sample rate of 44.1 kHz and duration up to 47 seconds. The quality is quite high. It's better to make prompts in English.
Example prompts: 1) Sound effects generation: cats meow, lion roar, dog bark 2) Sample generation: 128 BPM tech house drum loop 3) Specific instrument generation: A Coltrane-style jazz solo: fast, chaotic passages (200 BPM), with piercing saxophone screams and sharp dynamic changes
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It has several version. On MVSep we use the largest and the most precise: "Whisper large-v3". The Whisper large-v3 model was trained on several millions hours of audio. It's multilingual model and it guesses the language automatically. To apply model to your audio you have 2 options: 1) "Apply to original file" - it means that whisper model will be applied directly to file you submit 2) "Extract vocals first" - in this case before using whisper, BS Roformer model is applied to extract vocals first. It can remove unnecessary noise to make output of Whisper better.
Original model has some problem with transcription timings. It was fixed by @linto-ai. His transcription is used by default (Option: New timestamped). You can return to original timings by choosing option "Old by whisper".
Parakeet is a family of state-of-the-art Automatic Speech Recognition (ASR) models developed byNVIDIA in collaboration with Suno.ai. These models are built on the Fast Conformer architecture, designed to deliver a balance of high transcription accuracy and exceptional inference speed. They are widely recognized for outperforming much larger models (like OpenAI's Whisper) in efficiency while maintaining competitive or superior Word Error Rates (WER). Quality metric WER: 6.03 on Huggingface Open ASR Leaderboard.
Released as a highly efficient English-focused model, v2 established Parakeet as a leader in speed-to-accuracy ratio.
Language: English (en-US) only.
Size: 0.6 Billion parameters (600M), making it lightweight compared to the 1.1B parameters of previous versions.
Performance: It achieves industry-leading accuracy (approx. 6% WER on standard benchmarks) and is noted for being up to 50x faster than real-time.
Capabilities:
Supports highly accurate word-level timestamps.
Includes automatic punctuation and capitalization.
Effective at transcribing non-speech sounds like music lyrics and spoken numbers.
Can handle long-form audio (up to 11 hours in some configurations) using local attention mechanisms.
Parakeet v3 (Parakeet TDT 0.6B v3)
The v3 release marked the expansion of the efficient Parakeet architecture from English-only to a multilingual domain without increasing the model size.
Language: Multilingual (supprts 25 Euoropean languages, including English, Spanish, French, German, Russian, and others).
Size: Retains the compact 0.6 Billion parameter size.
Key Upgrade: It is trained on the massive Granary multilingual corpus (approx. 1 million hours of audio).
New Features:
Automatic Language Detection: The model can identify the spoken language from the audio signal and transcribe it without manual prompting.
High Throughput: Despite the added multilingual capabilities, it retains the ultra-fast inference speeds of the v2 TDT architecture.
Versatility: It serves as a drop-in replacement for v2 for users requiring support for European languages while maintaining low latency and compute costs.
VibeVoice — is a model for generating natural conversational dialogues from text with the ability to use a reference voice for cloning purposes.
Key features:
Two models: small and large
Up to 90 minutes of generated audio
Language support: 2 languages are supported: English (default) and Chinese
Voice cloning: ability to upload a reference audio recording
How to use the model
The text must be only in English or Chinese; quality is not guaranteed for other languages. Maximum text length is 5000 characters. Avoid special characters.
Audio with the reference voice requires 5 to 15 seconds. If your track is longer, it will be automatically trimmed at the 15th second.
The reference track should contain only voice and nothing else. If you have background sounds or music, use the "Extract vocals first" option.
How to generate a reference track?
We needphonetic diversity (all sounds of the language) and lively intonation. A text length of about 35–40 words when read calmly will take just ~15 seconds.
Here are three options in English for different tasks:
Option 1: Universal (Balanced & Clear)
The best choice for general use. Contains complex sound combinations to tune clarity.
"To create a perfect voice clone, the AI needs to hear a full range of phonetic sounds. I am speaking clearly, taking small pauses, and asking: can you hear every detail? This short sample captures the unique texture and tone of my voice."
Option 2: Conversational (Vlog & Social Media)
For voiceovers in videos, YouTube, or blogs. Read vividly, with a smile, changing the pitch of your voice.
"Hey! I’m recording this clip to test how well the new technology works. The secret is to relax and speak exactly like I would to a friend. Do you think the AI can really copy my style and energy in just fifteen seconds?"
Option 3: Professional (Business & Narration)
For presentations, audiobooks, or official announcements. Read confidently, slightly slower, emphasizing word endings.
"Voice synthesis technology is rapidly changing how we communicate in the digital age. It is essential to speak with confidence and precision to ensure high-quality output. This brief recording provides all the necessary data for a professional and accurate digital clone."
Tips for recording:
Pronunciation: Try to articulate word endings clearly (especially t, d, s, ing). Models "love" clear articulation.
Flow: Don't read like a robot. In English, melody (voice melody) is important — the voice should "float" up and down a bit, rather than sounding on a single note.
Breathing: If you pause at a comma or period, don't be afraid to take an audible breath. This will add realism to the clone.
VibeVoice (TTS) — is a model for generating natural conversational dialogues from text, capable of creating dialogues with up to 4 speakers and durations of up to 90 minutes.
Key Features:
Two models: small and large
Up to 4 speakers in a single recording
Up to 90 minutes of generated audio
Language support: officially supports 2 languages: English (default) and Chinese, but it has been verified to work decently for other languages as well.
How to use the model
The text must be in English or Chinese; quality is not guaranteed for other languages. The maximum text length is 5000 characters. Avoid special characters. The text must be formatted specifically to indicate speakers:
Correct format:
Speaker 1: Hello! How are you today?
Speaker 2: I'm doing great, thanks for asking!
Speaker 1: That's wonderful to hear.
Speaker 3: Hey everyone, sorry I'm late!
Incorrect format:
Hello! How are you today?
I'm doing great!
Important:
Each line must start with Speaker N: (where N is a number from 1 to 4)
Case does not matter: Speaker 1: = speaker 1: = SPEAKER 1
If you need a monologue, you do not need to specify a speaker.
Example scenarios:
Monologue (1 speaker):
Speaker 1: Today I want to talk about artificial intelligence.
Speaker 1: It's changing our world in incredible ways.
Speaker 1: From healthcare to entertainment, AI is everywhere.
Dialogue (2 speakers):
Speaker 1: Have you tried the new restaurant downtown?
Speaker 2: Not yet, but I've heard great things about it!
Speaker 1: We should go there this weekend.
Speaker 2: That sounds like a perfect plan!
Group conversation (3-4 speakers):
Speaker 1: Welcome to our podcast, everyone!
Speaker 2: Thanks for having us!
Speaker 3: It's great to be here.
Speaker 4: I'm excited to share our thoughts today.
Speaker 1: Let's start with introductions.
MVSep MultiSpeaker (MDX23C) - this model tries to isolate the most loud voice from all other voices. It uses MDX23C architecture. Still under development.
The algorithm adds "whispering" effect to vocals. Model was created by SUC-DriverOld. More details here.
The Aspiration model separates out: 1) Audible breaths 2) Hissing and buzzing of Fricative Consonants ( 's' and 'f' ) 3) Plosives: voiceless burst of air produced while singing a consonant (like /p/, /t/, /k/).
Matchering is a novel tool for audio matching and mastering. It follows a simple idea - you take TWO audio files and feed them into Matchering:
TARGET (the track you want to master, you want it to sound like the reference)
REFERENCE (another track, like some kind of "wet" popular song, you want your target to sound like it)
This algorithm matches both of these tracks and provides you the mastered TARGET track with the same RMS, FR, peak amplitude and stereo width as the REFERENCE track has.
SOME (Singing-Oriented MIDI Extractor) is a MIDI extractor that can convert singing voice to MIDI sequence. The model was only trained on Chinese voice, so it might not work well in other languages.
Transkun — is a modern open-source model for automatic piano music transcription (Audio-to-MIDI). The official page of the model is here. It is considered one of the best (SOTA — State of the Art) in its class. The model can recognize not only the notes themselves but also their duration, loudness (velocity), and pedal usage. Unlike many older models that analyze music «frame-by-frame» (frame-based), Transkun uses the Neural Semi-CRF (semi-Markov Conditional Random Field) approach. Instead of asking «is a note sounding at this millisecond?», the model treats events as whole intervals (from the start to the end of the note). The latest versions use a Transformer (Non-Hierarchical Transformer) which calculates the probability that a specific time segment is a note. Decoding: The Viterbi algorithm is used to find the most probable sequence of non-overlapping intervals. The model demonstrates excellent results on the MAESTRO dataset (the industry standard).
Basic Pitch is a modern neural network from Spotify’s Audio Intelligence Lab that converts melodic audio recordings into notes (MIDI format).Unlike outdated converters, this model can "hear" not only individual notes but also chords, along with the finest nuances of a performance. Official page: https://github.com/spotify/basic-pitch
Key Features
Polyphonic recognition: Basic Pitch handles complexity with ease. You can upload recordings of piano, guitar, or ensembles — the model recognizes multiple notes sounding simultaneously.
Nuance preservation (Pitch Bend): Most converters "quantize" sound to the nearest note, stripping away expression. Basic Pitch preserves pitch changes (pitch bends). If you sing with vibrato or perform bends on a guitar, these details will remain in the MIDI file.
Versatility: The model is trained on a massive dataset and works with most melodic instruments.
Speed and efficiency: It is a lightweight model that processes audio quickly without requiring powerful servers.
What instruments does the model work with?
Basic Pitch is an "instrument-agnostic" model. This means it handles different timbres equally well: - Vocals: Hum a melody into a microphone, and the neural network will turn your voice into a synthesizer part. - Strings: Acoustic and electric guitar, violin, cello. - Keyboards: Pianos, organs, and synthesizers. - Winds: Flute, saxophone, trumpet, and others.
Important: The model is designed for melodic instruments. It is not suitable for drums or percussion, as it focuses on pitch rather than rhythmic noise.
Note: For version A only MUSDB18 training data was used for training, so quality is worse than Demucs3 Model B. Demucs3 Model A and Demucs3 Model B has the same architecture, but has different weights.
Experimental model VitLarge23 based on Vision Transformers.In terms of metrics, it is slightly inferior to the MDX23C, but may work better in some cases.
Mel Band Roformer - a model proposed by employees of the company ByteDance for the competition Sound Demixing Challenge 2023, where they took first place on LeaderBoard C. Unfortunately, the model was not made publicly available and was reproduced according to a scientific article by the developer @lucidrains on the github. The vocal model was trained from scratch on our internal dataset. Unfortunately, we have not yet been able to achieve similar metrics as the authors.
The LarsNet model divides the drums stem into 5 types: 'kick', 'snare', 'cymbals', 'toms', 'hihat'. The model is from this github repository and it was trained on the dataset StemGMD. The model has two operating modes. The first (default) applies the Demucs4 HT model to the track at stage one, which extracts only the drum part from the track. On the second stage, the LarsNet model is used. If your track consists only of drums, then it makes sense to use the second mode, where the LarsNet model is applied directly to the uploaded audio. Unfortunately, subjectively, the quality of separation is inferior in quality to the model DrumSep.