MVSEP Logo
  • Home
  • News
  • Plans
  • Demo
  • Create Account
  • Login
  • Theme
    Model Selector
    Language
    • English
    • Русский
    • 中文
    • اَلْعَرَبِيَّةُ
    • Polski
    • Portugues do Brasil
    • Español
    • 日本語
    • Français
    • Oʻzbekcha
    • Türkçe
    • हिन्दी
    • Tiếng Việt
    • Deutsch
    • 한국어
    • Bahasa Indonesia
    • Italiano
    • Svenska
    • suomi
    • български език
    • magyar nyelv
    • עִבְֿרִית
    • ภาษาไทย
    • hrvatski
    • Română

MVSep Wind (wind, other)

The MVSep Wind model produces high-quality separation of music into a wind part and everything else. The MVSep Wind model exists in 2 different variants based on following architectures: MelRoformer and SCNet Large. Wind includes 2 categories of instruments: brass and woodwind. More specific we inluded in wind: flute, saxophone, trumpet, trombone, horn, clarinet, oboe, harmonica, bagpipes, bassoon, tuba, kazoo, piccolo, flugelhorn, ocarina, shakuhachi, melodica, reeds, didgeridoo, mussette, gaida.

Quality metrics

Algorithm name Wind dataset
SDR Wind SDR Other
MelBand Roformer 6.73 16.10
SCNet Large 6.76 16.13
MelBand + SCNet Ensemble 7.22 16.59
MelBand + SCNet Ensemble (+extract from Instrumental) --- ---
BS Roformer 9.82 19.19

🗎 Copy link | Use algorithm | Demo

MVSep Brass (brass, other)

The MVSep Brass is a high quality model for separating music into brass wind instruments and everything else. List of instruments: trumpet, trombone, horn, tuba, flugelhorn, untagged brass.

🗎 Copy link | Use algorithm | Demo

MVSep Woodwind (woodwind, other)

The MVSep Woodwind is a high quality model for separating music into woodwind instruments and everything else. List of instruments: oboe, saxophone, flute, bassoon, clarinet, piccolo, english horn, untagged woodwind.

🗎 Copy link | Use algorithm | Demo

MVSep Bagpipes (bagpipes , other)

The bagpipe (Bagpipes) is a traditional wind musical instrument known for its characteristic piercing and continuous sound.

How it is constructed:

  • The bag (reservoir): Usually made of animal skin or modern synthetic materials. It serves to store air.

  • Blowpipe: Through this, the musician fills the bag with air using their mouth (in some variations, small bellows pumped by the elbow are used instead).

  • Melody pipe (chanter): A pipe with finger holes, on which the musician plays the main melody by moving their fingers.

  • Drone pipes (drones): One or more pipes that produce a constant, sustained background chord on a single note.

The principle of playing is that the musician inflates the bag and then presses on it with their arm, evenly pushing air into the sound pipes. Thanks to this reservoir, the music does not stop, even when the performer takes a breath.

Although the bagpipe is most often associated with Scotland (Great Highland Bagpipe) and Celtic culture, various historical variations of it exist throughout Europe, North Africa, and the Middle East.

🗎 Copy link | Use algorithm | Demo

MVSep Percussion (percussion, other)

The MVSep Percussion is a high quality model for separating music into percussion instruments and everything else. List of instruments: bells, tubular bell, cow bell, congas, celeste, marimba, glockenspiel, tambourine, timpani, triangle, wind chimes, bongos, clap, xylophone, mallets, metal bars, wooden bars.

🗎 Copy link | Use algorithm | Demo

BandIt Plus (speech, music, effects)

BandIt Plus model for separating tracks into speech, music and effects. The model can be useful for television or film clips. The model was prepared by the authors of the article "A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation" in the repository on GitHub. The model was trained on the Divide and Remaster (DnR) dataset. And at the moment it has the best quality metrics among similar models.

Quality table

Algorithm name DnR dataset (test)
SDR Speech SDR Music SDR Effects
BandIt Plus 15.62 9.21 9.69
🗎 Copy link | Use algorithm | Demo

BandIt v2 (speech, music, effects)

Bandit v2 is a model for cinematic audio source separation in 3 stems: speech, music, effects/sfx. It was trained on DnR v3 dataset.

More information in official repository: https://github.com/kwatcharasupat/bandit-v2
Paper: https://arxiv.org/pdf/2407.07275

🗎 Copy link | Use algorithm | Demo

MVSep DnR v3 (speech, music, effects)

MVSep DnR v3 is a cinematic model for splitting tracks into 3 stems: music, sfx and speech. It is trained on a huge multilingual dataset DnR v3. The quality metrics on the test data turned out to be better than those of a similar multilingual model Bandit v2. The model is available in 3 variants: based on SCNet, MelBand Roformer architectures, and an ensemble of these two models. See the table below:

Algorithm name SDR Metric on DnR v3 leaderboard
music (SDR) sfx (SDR) speech (SDR)
SCNet Large  9.94 11.35 12.59
Mel Band Roformer 9.45 11.24 12.27
Ensemble (Mel + SCNet) 10.15 11.67 12.81
Bandit v2 (for reference) 9.06 10.82 12.29
🗎 Copy link | Use algorithm | Demo

MVSep Braam (braam , other)

Braam is not a traditional physical instrument, but a powerful cinematic sound effect (virtual instrument) that has become an absolute standard in modern film and trailer music.

Main features:

  • Sound: It is a massive, low-frequency, rumbling, and often aggressive sound. It resembles an apocalyptic blast of a huge ship's horn, heavy metallic scraping, or an alarm signal.

  • Origin: This sound gained massive popularity after the release of the movie "Inception" (2010) with music by Hans Zimmer, which is why it is often called the Inception Horn.

  • How it is created: As a rule, it is the result of complex sound design. The base is formed by powerful low brass instruments (trombones, tubas, French horns). Then they are layered over heavy synthesizer basses and heavily processed with effects: distortion, saturation, and deep reverberation.

Today, Braam exists in the form of ready-made samples and libraries for virtual synthesizers (VST plugins), which composers use to instantly give a track scale, tension, or an epic feel.

🗎 Copy link | Use algorithm | Demo

Apollo Enhancers (by JusperLee, Lew, baicai1145)

The algorithm restores the quality of audio. Model was proposed in this paper and published on github.

There are 3 models available:
1) MP3 Enhancer (by JusperLee) - it restores MP3 files compressed with bitrate 32 kbps up to 128 kbps. It will not work for files with larger bitrate.
2) Universal Super Resolution (by Lew) - it restore higher frequences for any music
3) Vocals Super Resolution (by Lew) - it restore higher frequences and overall quality for any vocals

🗎 Copy link | Use algorithm | Demo

Reverb Removal (noreverb)

Set of different models to remove reverberation effect from music/vocals.

Author Architecture Works with SDR (no independent testing yet)
FoxJoy MDX-B Full track ~6.50
anvuew MelRoformer Only vocals 7.56
anvuew BSRoformer Only vocals 8.07
anvuew v2 MelRoformer Only vocals ---
Sucial MelRoformer Only vocals 10.01
anvuew BSRoformer Only vocals (Room) ---
🗎 Copy link | Use algorithm | Demo

AudioSR (Super Resolution)

Algorithm AudioSR: Versatile Audio Super-resolution at Scale. Algorithm restores high frequencies. It works on all types of audio (e.g., music, speech, dog, raining, ...). It was initially trained for mono audio, so it can give not so stable result on stereo.

Metric on Super Resolution Checker for Music Leaderboard (Restored): 25.3195
Authors' paper: https://arxiv.org/pdf/2309.07314
Original repository: https://github.com/haoheliu/versatile_audio_super_resolution
Original inference script prepared by @jarredou: https://github.com/jarredou/AudioSR-Colab-Fork

🗎 Copy link | Use algorithm | Demo

FlashSR (Super Resolution)

FlashSR - audio super resolution algorithm for restoring high frequencies. It's based on paper FlashSR: One-step Versatile Audio Super-resolution via Diffusion Distillation. 

Metric on Super Resolution Checker for Music Leaderboard (Restored): 22.1397
Original repository: https://github.com/jakeoneijk/FlashSR_Inference
Inference script by @jarredou: https://github.com/jarredou/FlashSR-Colab-Inference

🗎 Copy link | Use algorithm | Demo

Stable Audio Open Gen

Audio generation based on a given text prompt. The generation uses the Stable Audio Open 1.0 model. Audio is generated in Stereo format with a sample rate of 44.1 kHz and duration up to 47 seconds. The quality is quite high. It's better to make prompts in English.

Example prompts:
1) Sound effects generation: cats meow, lion roar, dog bark
2) Sample generation: 128 BPM tech house drum loop
3) Specific instrument generation: A Coltrane-style jazz solo: fast, chaotic passages (200 BPM), with piercing saxophone screams and sharp dynamic changes

🗎 Copy link | Use algorithm | Demo

Whisper (extract text from audio)

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It has several version. On MVSep we use the largest and the most precise: "Whisper large-v3". The Whisper large-v3 model was trained on several millions hours of audio. It's multilingual model and it guesses the language automatically. To apply model to your audio you have 2 options: 
1) "Apply to original file" - it means that whisper model will be applied directly to file you submit
2) "Extract vocals first" - in this case before using whisper, BS Roformer model is applied to extract vocals first. It can remove unnecessary noise to make output of Whisper better.

Original model has some problem with transcription timings. It was fixed by @linto-ai. His transcription is used by default (Option: New timestamped). You can return to original timings by choosing option "Old by whisper".

More info on model can be found here: https://huggingface.co/openai/whisper-large-v3 and here: https://github.com/openai/whisper

🗎 Copy link | Use algorithm | Demo

Parakeet (extract text from audio)

Parakeet is a family of state-of-the-art Automatic Speech Recognition (ASR) models developed by NVIDIA in collaboration with Suno.ai. These models are built on the Fast Conformer architecture, designed to deliver a balance of high transcription accuracy and exceptional inference speed. They are widely recognized for outperforming much larger models (like OpenAI's Whisper) in efficiency while maintaining competitive or superior Word Error Rates (WER). Quality metric WER: 6.03 on Huggingface Open ASR Leaderboard.

MVSep provide two versions of model (v2 and v3):
Model page v2: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2
Model page v3: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3


Parakeet v2 (Parakeet TDT 0.6B v2)

Released as a highly efficient English-focused model, v2 established Parakeet as a leader in speed-to-accuracy ratio.

  • Language: English (en-US) only.
  • Size: 0.6 Billion parameters (600M), making it lightweight compared to the 1.1B parameters of previous versions.
  • Performance: It achieves industry-leading accuracy (approx. 6% WER on standard benchmarks) and is noted for being up to 50x faster than real-time.
  • Capabilities:
    • Supports highly accurate word-level timestamps.
    • Includes automatic punctuation and capitalization.
    • Effective at transcribing non-speech sounds like music lyrics and spoken numbers.
    • Can handle long-form audio (up to 11 hours in some configurations) using local attention mechanisms.

Parakeet v3 (Parakeet TDT 0.6B v3)

The v3 release marked the expansion of the efficient Parakeet architecture from English-only to a multilingual domain without increasing the model size.

  • Language: Multilingual (supprts 25 Euoropean languages, including English, Spanish, French, German, Russian, and others).
  • Size: Retains the compact 0.6 Billion parameter size.
  • Key Upgrade: It is trained on the massive Granary multilingual corpus (approx. 1 million hours of audio).
  • New Features:
    • Automatic Language Detection: The model can identify the spoken language from the audio signal and transcribe it without manual prompting.
    • High Throughput: Despite the added multilingual capabilities, it retains the ultra-fast inference speeds of the v2 TDT architecture.
    • Versatility: It serves as a drop-in replacement for v2 for users requiring support for European languages while maintaining low latency and compute costs.

🗎 Copy link | Use algorithm | Demo

VibeVoice (Voice Cloning)

VibeVoice — is a model for generating natural conversational dialogues from text with the ability to use a reference voice for cloning purposes.

Key features:

  • Two models: small and large
  • Up to 90 minutes of generated audio
  • Language support: 2 languages are supported: English (default) and Chinese
  • Voice cloning: ability to upload a reference audio recording

How to use the model

  • The text must be only in English or Chinese; quality is not guaranteed for other languages. Maximum text length is 5000 characters. Avoid special characters. 
  • Audio with the reference voice requires 5 to 15 seconds. If your track is longer, it will be automatically trimmed at the 15th second. 
  • The reference track should contain only voice and nothing else. If you have background sounds or music, use the "Extract vocals first" option.

How to generate a reference track?

We need phonetic diversity (all sounds of the language) and lively intonation. A text length of about 35–40 words when read calmly will take just ~15 seconds.

Here are three options in English for different tasks:

Option 1: Universal (Balanced & Clear)

The best choice for general use. Contains complex sound combinations to tune clarity.

"To create a perfect voice clone, the AI needs to hear a full range of phonetic sounds. I am speaking clearly, taking small pauses, and asking: can you hear every detail? This short sample captures the unique texture and tone of my voice."

Option 2: Conversational (Vlog & Social Media)

For voiceovers in videos, YouTube, or blogs. Read vividly, with a smile, changing the pitch of your voice.

"Hey! I’m recording this clip to test how well the new technology works. The secret is to relax and speak exactly like I would to a friend. Do you think the AI can really copy my style and energy in just fifteen seconds?"

Option 3: Professional (Business & Narration)

For presentations, audiobooks, or official announcements. Read confidently, slightly slower, emphasizing word endings.

"Voice synthesis technology is rapidly changing how we communicate in the digital age. It is essential to speak with confidence and precision to ensure high-quality output. This brief recording provides all the necessary data for a professional and accurate digital clone."


Tips for recording:

  1. Pronunciation: Try to articulate word endings clearly (especially t, d, s, ing). Models "love" clear articulation.

  2. Flow: Don't read like a robot. In English, melody (voice melody) is important — the voice should "float" up and down a bit, rather than sounding on a single note.

  3. Breathing: If you pause at a comma or period, don't be afraid to take an audible breath. This will add realism to the clone.

🗎 Copy link | Use algorithm | Demo

VibeVoice (TTS)

VibeVoice (TTS) — is a model for generating natural conversational dialogues from text, capable of creating dialogues with up to 4 speakers and durations of up to 90 minutes.

Key Features:

  • Two models: small and large
  • Up to 4 speakers in a single recording
  • Up to 90 minutes of generated audio
  • Language support: officially supports 2 languages: English (default) and Chinese, but it has been verified to work decently for other languages as well.

How to use the model

The text must be in English or Chinese; quality is not guaranteed for other languages. The maximum text length is 5000 characters. Avoid special characters. The text must be formatted specifically to indicate speakers:

Correct format:

Speaker 1: Hello! How are you today?
Speaker 2: I'm doing great, thanks for asking!
Speaker 1: That's wonderful to hear.
Speaker 3: Hey everyone, sorry I'm late!

Incorrect format:

Hello! How are you today?
I'm doing great!

Important:

  • Each line must start with Speaker N: (where N is a number from 1 to 4)
  • Speaker numbering: Speaker 1, Speaker 2, Speaker 3, Speaker 4
  • You can use from 1 to 4 speakers
  • Case does not matter: Speaker 1: = speaker 1: = SPEAKER 1

If you need a monologue, you do not need to specify a speaker.

Example scenarios:

Monologue (1 speaker):

Speaker 1: Today I want to talk about artificial intelligence.
Speaker 1: It's changing our world in incredible ways.
Speaker 1: From healthcare to entertainment, AI is everywhere.

Dialogue (2 speakers):

Speaker 1: Have you tried the new restaurant downtown?
Speaker 2: Not yet, but I've heard great things about it!
Speaker 1: We should go there this weekend.
Speaker 2: That sounds like a perfect plan!

Group conversation (3-4 speakers):

Speaker 1: Welcome to our podcast, everyone!
Speaker 2: Thanks for having us!
Speaker 3: It's great to be here.
Speaker 4: I'm excited to share our thoughts today.
Speaker 1: Let's start with introductions.
🗎 Copy link | Use algorithm | Demo

Qwen3-TTS (Custom Voice)

Qwen3-TTS is a powerful speech generation model offering comprehensive support for voice cloning, voice design, ultra-high-quality human-like speech generation, and natural language-based voice control. It provides developers and users with the most extensive set of speech generation features available. At MVSep, we use the largest 1.7 billion parameter model.

Original model page: https://github.com/QwenLM/Qwen3-TTS

Qwen3-TTS (Custom Voice) offers a set of 9 pre-defined speakers. Optionally, you can specify a "Voice description" to include emotions like "happy voice" or "sad voice". You can also choose the language for this model or leave it as "auto".

🗎 Copy link | Use algorithm | Demo

Qwen3-TTS (Voice Design)

Qwen3-TTS is a powerful speech generation model offering support for voice cloning, voice design, ultra-high-quality human-like speech generation, and natural language-based voice control. It provides developers and users with the most extensive set of speech generation features available. At MVSep, we use the largest 1.7 billion parameter model.

Original model page: https://github.com/QwenLM/Qwen3-TTS

Qwen3-TTS (Voice Design) allows you to generate speech with a custom voice that can be described in detail in the "Voice description" field. You can specify the speaker's gender and age, and add emotions, such as "happy voice" or "sad voice". You can also choose the language for this model or leave it as "auto".

🗎 Copy link | Use algorithm | Demo

Qwen3-TTS (Voice Cloning)

Qwen3-TTS is a powerful speech generation model offering support for voice cloning, voice design, ultra-high-quality human-like speech generation, and natural language-based voice control. It provides developers and users with the most extensive set of speech generation features available. At MVSep, we use the largest 1.7 billion parameter model.

Original model page: https://github.com/QwenLM/Qwen3-TTS

Qwen3-TTS (Voice Cloning) allows you to upload a reference audio file to generate the target text using the sample voice. To improve cloning quality, you can optionally provide the audio transcript in the "Reference text in audio" field. You can also choose the language for this model or leave it as "auto".

🗎 Copy link | Use algorithm | Demo

Bark (Speech Gen)

Bark — is a transformer-based model created by Suno, representing not just a traditional text-to-speech tool, but a fully generative "text-to-audio" system. Its capabilities go far beyond ordinary voicing: besides creating highly realistic speech in multiple languages, Bark can generate music, background noises, and simple sound effects. A unique feature of the model is the ability to reproduce subtle non-verbal communications, such as laughter, sighs, and crying, making the resulting sound maximally alive and natural.

Striving to support the community, the developers have opened access to pre-trained checkpoints that are ready for work and allowed even for commercial use. However, it is important to consider that Bark was created primarily for research tasks. Being a fully generative model, it can behave unpredictably and sometimes deviate from the provided text prompts.

Official model repository: https://github.com/suno-ai/bark

Unlike classic TTS systems, Bark does not use SSML markup. Instead, it is trained to recognize specific text inserts (tags) as instructions for generating sounds.

Instructions for coding emotions and sounds in Bark

1. Basic Principle

All control commands are written in square brackets. Important: The tags themselves must be written in English, even if the main text you are generating is in Russian, Spanish, or any other language.

Syntax:

Text before effect [effect_tag] text after effect.

2. List of supported tags (Non-speech sounds)

Bark officially recognizes the following set of tokens for non-verbal sounds:

Tag Description Usage Example
[laughter] Loud, distinct laugh Hello! [laughter] That was so funny.
[laughs] Short chuckle, giggling Well yes, of course [laughs].
[sighs] Heavy sigh (fatigue, relief) [sighs] I'm so tired of this work.
[music] Instrumental music insertion [music] (background music playing)
[gasps] Sharp breath (fright, surprise) [gasps] I didn't expect to see you here!
[clears throat] Throat clearing (attracting attention) [clears throat] Gentlemen, may I have your attention.

Note: Variations like [man laughs] and [woman laughs] also exist, but they work most stably if the speaker's gender (Speaker History) matches the tag.

3. Generating singing and music

To make the model "sing" the text rather than read it, musical notes are used.

  • Method: Wrap the text in musical note symbols ♪ (Shift + Alt + V on Mac or Alt+13 on Win, or just copy).

  • Example: ♪ In the jungle, the mighty jungle, the lion sleeps tonight ♪

  • Tip: This works best if you use English, as the training dataset contained many English songs, but results can be achieved in other languages too.

4. Pauses and Intonation (Prosody)

Although there are no special tags for pauses (like ), Bark is sensitive to punctuation and special characters, as it perceives text as a structure.

  • Ellipsis and dash (..., —): Use an ellipsis or an em dash to create pauses, hesitations, or hitches in speech.

    • Example: I... I'm not sure that's right.

  • CAPS LOCK: Sometimes (not guaranteed) writing a word in CAPITAL LETTERS can add emphasis or increase volume.

5. Important nuances of operation (Disclaimer)

  1. Probabilistic nature: Bark is a GPT for audio. If you write [laughter], the model will with high probability generate laughter, but sometimes it may ignore the tag or generate a strange sound.

  2. Context matters: The tag [laughter] will work more naturally after a joke than in the middle of a tragic sentence. The model "understands" the semantics of the text.

  3. Whispering: There is no official [whisper] tag. However, the community has noticed that adding words like "quietly" or using specific speakers (Speaker Prompts) sometimes helps, but this is a trial and error method.

Site limitations: currently, all submitted texts are trimmed to 1000 characters.

🗎 Copy link | Use algorithm | Demo

MVSep MultiSpeaker (MDX23C)

MVSep MultiSpeaker (MDX23C) - this model tries to isolate the most loud voice from all other voices. It uses MDX23C architecture. Still under development.

🗎 Copy link | Use algorithm | Demo

Aspiration (by Sucial)

The algorithm adds "whispering" effect to vocals. Model was created by SUC-DriverOld. More details here.

The Aspiration model separates out:
1) Audible breaths
2) Hissing and buzzing of Fricative Consonants ( 's' and 'f' )
3) Plosives: voiceless burst of air produced while singing a consonant (like /p/, /t/, /k/).

🗎 Copy link | Use algorithm | Demo

Matchering (by sergree)

Matchering is a novel tool for audio matching and mastering. It follows a simple idea - you take TWO audio files and feed them into Matchering:

  • TARGET (the track you want to master, you want it to sound like the reference)
  • REFERENCE (another track, like some kind of "wet" popular song, you want your target to sound like it)

This algorithm matches both of these tracks and provides you the mastered TARGET track with the same RMS, FR, peak amplitude and stereo width as the REFERENCE track has.

It based on code by @sergree.

🗎 Copy link | Use algorithm | Demo

SOME (Singing-Oriented MIDI Extractor)

SOME (Singing-Oriented MIDI Extractor) is a MIDI extractor that can convert singing voice to MIDI sequence. The model was only trained on Chinese voice, so it might not work well in other languages.

Original page: https://github.com/openvpi/SOME

🗎 Copy link | Use algorithm | Demo

Transkun (piano -> midi)

Transkun — is a modern open-source model for automatic piano music transcription (Audio-to-MIDI). The official page of the model is here. It is considered one of the best (SOTA — State of the Art) in its class. The model can recognize not only the notes themselves but also their duration, loudness (velocity), and pedal usage. Unlike many older models that analyze music «frame-by-frame» (frame-based), Transkun uses the Neural Semi-CRF (semi-Markov Conditional Random Field) approach. Instead of asking «is a note sounding at this millisecond?», the model treats events as whole intervals (from the start to the end of the note). The latest versions use a Transformer (Non-Hierarchical Transformer) which calculates the probability that a specific time segment is a note. Decoding: The Viterbi algorithm is used to find the most probable sequence of non-overlapping intervals. The model demonstrates excellent results on the MAESTRO dataset (the industry standard).

🗎 Copy link | Use algorithm | Demo

Basic Pitch (MIDI Extraction)

Basic Pitch is a modern neural network from Spotify’s Audio Intelligence Lab that converts melodic audio recordings into notes (MIDI format). Unlike outdated converters, this model can "hear" not only individual notes but also chords, along with the finest nuances of a performance. Official page: https://github.com/spotify/basic-pitch

Key Features

  • Polyphonic recognition: Basic Pitch handles complexity with ease. You can upload recordings of piano, guitar, or ensembles — the model recognizes multiple notes sounding simultaneously.
  • Nuance preservation (Pitch Bend): Most converters "quantize" sound to the nearest note, stripping away expression. Basic Pitch preserves pitch changes (pitch bends). If you sing with vibrato or perform bends on a guitar, these details will remain in the MIDI file.
  • Versatility: The model is trained on a massive dataset and works with most melodic instruments.
  • Speed and efficiency: It is a lightweight model that processes audio quickly without requiring powerful servers.

What instruments does the model work with?

Basic Pitch is an "instrument-agnostic" model. This means it handles different timbres equally well:
- Vocals: Hum a melody into a microphone, and the neural network will turn your voice into a synthesizer part.
- Strings: Acoustic and electric guitar, violin, cello.
- Keyboards: Pianos, organs, and synthesizers.
- Winds: Flute, saxophone, trumpet, and others.

Important: The model is designed for melodic instruments. It is not suitable for drums or percussion, as it focuses on pitch rather than rhythmic noise.

🗎 Copy link | Use algorithm | Demo

HeartMuLa (Song Gen)

HeartMuLa — is an advanced open-source family of multimodal foundation models (Apache 2.0 license) designed for high-quality music synthesis and audio processing. Unlike proprietary cloud services (such as Suno or Udio), HeartMuLa gives developers full control over the generation process and the ability to run it locally on their own hardware. The model operates on an LLM architecture and allows for the creation of complete compositions from text prompts.

Official repository: https://github.com/HeartMuLa/heartlib

Key Features

  • Multilingual vocal generation: Supports speech and singing synthesis in multiple languages, including English, Chinese, Japanese, Korean, and Spanish.

  • Fine-grained structural control: The use of special tags in the lyrics (for example, [Intro], [Verse], [Chorus], [Bridge]) allows for precise control over the arrangement and progression of the composition.

  • Musical attribute control: The model understands complex descriptions perfectly. You can specify the genre (rock, jazz, R&B, metal), timbre (dark, bright, soft), emotions, and specific instruments.

Architecture: How It Works Under the Hood

The system is not monolithic; it is a complex of several specialized neural networks combined into a single audio processing pipeline:

  1. HeartCodec (Neural Codec): The foundation of the system is a music tokenizer operating at an extremely low frequency (12.5 Hz). It provides the highest accuracy of signal reconstruction (high fidelity) with a minimal amount of data. This is critical for the language model to efficiently generate long audio fragments in an autoregressive mode.

  2. HeartCLAP: An audio-text alignment model. It creates a unified embedding space, thanks to which a textual description like "a sad melody on an acoustic guitar" is mathematically mapped accurately to the required acoustic characteristics.

  3. HeartTranscriptor: A module based on the Whisper architecture, fine-tuned specifically for transcribing lyrics and extracting phonetic features from vocals.

  4. HeartMuLa Generator: The main LLM model with a three-level architecture:

    • The global backbone processes text tokens and audio encodings.

    • The local decoder is responsible for the direct synthesis of music based on hidden states.

    • The detokenizer converts the generated tokens back into a continuous sound wave (waveform).

A multi-encoder strategy was used for training: the model extracts data from pre-trained Whisper, WavLM networks, and its own MuEncoder, which allows analyzing sound simultaneously at the phonetic, semantic, and acoustic levels.

Important note: For the sake of AI ethics, according to the developers' technical documentation, an invisible digital watermark is embedded in the generated tracks to identify the machine origin of the audio.

Guide: How to Properly Format Lyrics and Use Tags

When working with AI models for music generation, the text (lyrics) serves two functions: it tells the model what to sing, and meta-tags in square brackets [...] indicate how to sing it and how to build the structure of the track. The model treats the tags as directorial cues.

1. Basic Structural Tags (The Skeleton of the Song)

These tags break up solid text into logical musical blocks. They should be written on a new line before a block of text.

  • [Intro] — Introduction. Sets the mood and tempo before the vocals begin. Lyrics are usually not written under this tag, or short atmospheric phrases or vocalizations are added (for example, Ooh-ooh).

  • [Verse] (or [Verse 1], [Verse 2]) — The verse. The story unfolds here. The music in the verses is usually calmer, and the rhythm is steady. Using numbering helps the model understand that the melody should repeat, but the text will be new.

  • [Pre-Chorus] — The pre-chorus. A transitional part where the tension and density of the instruments build up before the main climax.

  • [Chorus] — The chorus. The main idea and the most memorable melody. Here, the model usually delivers maximum emotion, sound density, and vocal expression.

  • [Bridge] — The bridge. Inserted closer to the end of the song (usually after the second chorus). In this part, the melody, rhythm, or key changes dramatically so that the song does not feel monotonous.

  • [Outro] — The coda (ending). A smooth fading of the music (fade-out) or a beautiful final chord.

2. Instrumental and Stylistic Tags

You can control not only the structure but also the arrangement at specific moments in time.

  • Solos and interludes: Use tags like [Guitar Solo], [Piano Interlude], [Bass Drop], or [Drum Fill] between verses and choruses. You do not need to write lyrics under them.

  • Vocal directions: If the model supports it, you can specify the performance style before a line: [Whisper], [Scream], [Spoken], [Choir].

  • Backing vocals and echoes: To add backing vocals or choral responses, enclose words in parentheses. For example: Lead: Walking down this lonely road Backing: (lonely road)

3. Golden Rules for Writing the Lyrics

Even with perfect tags, the AI can get confused if the text itself is poorly structured.

  1. Symmetry and rhythm: AI models rely on the number of syllables. Try to ensure that lines in the same verse have approximately the same number of syllables and a clear meter. If one line consists of 5 words and the next of 15, the model will start "mumbling" words or break the rhythm.

  2. Punctuation is breath: Commas , and periods . act as pauses. If you need the vocalist to take a breath or pause before an important word, insert a comma. The lack of punctuation will force the AI to sing in a tongue-twister fashion.

  3. Language: As discussed earlier, write your lyrics strictly in English (or another officially supported language) to avoid phonetic "garbage" and accents.

  4. Separation: Be sure to leave a blank line (line break) between different blocks (between the verse and chorus).

4. Perfect Template (Example)

Here is what a properly formatted prompt for generation should look like:

[Intro]
[Verse]
You could, you could take me to a place that's new,
A wave or a cloud, holding hands with only you.
You could, you could tell me that you have it all,
Everything I wanted, catching me before I fall.

[Chorus]
Take me out for a ride,
Along the docks or in the wood,
You could make me sick inside,
In just a single word you could.

[Chorus]
The sky would think it's right,
It loves the simple things,
My heart against yours tight,
If you love the simple things.
No, let's not do what others do
No, let's not do what others do
La, lalala, lalala
If you love the simple things
La, lalala, lalala, lalala la la la

[Verse]
You could, you could take me out to eat somewhere,
The finest of tables, honestly I do not care.
You could, you could use your charms to make me yield,
Step right up to me and make me drop my shield.

[Chorus]
Take me out for a ride,
Along the docks or in the wood,
You could make me sick inside,
In just a single word you could.

[Chorus]
The sky would think it's right,
It loves the simple things,
My heart against yours tight,
If you love the simple things.
No, let's not do what others do
No, let's not do what others do

[Bridge]
Take me my boy, let me take a little bite,
Behind a deep feeling hides a man of great might.
Make me fly my boy, tell me if we are alright,
Behind a deep feeling hides a man of great might.

[Chorus]
Take me out for a ride,
Along the docks or in the wood,
You could make me sick inside,
In just a single word you could.

[Chorus] 
The sky would think it's right,
It loves the simple things,
My heart against yours tight,
If you love the simple things.
La, lalala, lalala
If you love the simple things
La, lalala, lalala
La la la la la la la

Tag Guide (Prompt Engineering)

This guide is based on the analysis of the HeartMuLa research paper (Sections 3.2 and 6.2). The model uses a natural language tokenizer (Llama 3) rather than a fixed dictionary. To achieve stable generation, select tags from the 8 primary categories used during training.

The 8 Pillars of Training

Each category has an importance percentage representing its "selection probability" during training.

  • Training frequency: Tags were "sampled" (selected) during the training process. Genre was included 95% of the time, while Instrument was only included 25%.

  • Model expectations: For proper operation, the model expects the presence of a genre tag. Without it, the generation lacks a clear structural anchor.

  • Influence and stability: A higher percentage means greater stability. A tag with a 95% probability (Genre) is a "strong anchor," whereas a tag with 10% (Topic) is a "weak hint" that may be ignored if it conflicts with stronger tags.

  • Strategy: For maximum control, rely heavily on the top 4 categories (Genre, Timbre, Gender, Mood). Use low-percentage tags only as "seasoning" after the main structure is set.

Official Categories

GENRE (95% — MANDATORY)

  • Examples: Pop, Rock, Electronic, Hiphop, Jazz, Classical, Techno, Trance, Ambient.

TIMBRE (50% — Sound Texture)

  • Examples: Soft, Warm, Husky, Bright, Dark, Distorted.

GENDER (37% — Vocal Character)

  • Examples: Male, Female.

MOOD (32% — Emotional Vibe)

  • Examples: Happy, Sad, Energetic, Joyful, Melancholic, Relaxing, Dark.

INSTRUMENT (25% — Dominant Sounds)

  • Examples: Piano, Synthesizer, Acoustic Guitar, Electric Guitar, Bass, Drums, Strings, Violin.

SCENE (20% — Listening Context)

  • Examples: Dance, Workout, Dating, Study, Cinematic, Party.

REGION (12% — Cultural Influence)

  • Examples: K-pop, Latin, Western.

TOPIC (10% — Lyrical Theme)

  • Examples: Love, Summer, Heartbreak.

For convenience, all 8 categories and their possible tags are presented as separate selection options on the website. You can skip them and enter your own set of tags in the Tags (optional) field.

Prompting Strategy: "Less is More"

To maintain a strong anchor and avoid "probability interference," do not use conflicting tags.

  • Semantic conflict: The prompt "Rock, Jazz" scatters the model's attention, which often leads to "muddy" or unexpressive, generic arrangements.

  • Anchor stability: One strong anchor provides a clear roadmap. Multiple genres create conflicting maps, causing the AI to lose focus.

  • Recommendation: Choose only one tag for each category. Be precise and avoid overly broad concepts.

Recommended Format

Use a comma-separated list.

Examples:

  • Electronic, Techno, Synthesizer, Dark, High Energy, Club

  • Pop, Piano, Female, Sad, Soft, Love, Acoustic

🗎 Copy link | Use algorithm | Demo

Demucs3 Model (vocals, drums, bass, other)

Algorithm Demucs3 splits track into 4 stems (bass, drums, vocals, other). The winner of the Music Demuxing Challenge 2021. 

Link: https://github.com/facebookresearch/demucs/tree/v3

Quality table

Algorithm name Multisong dataset Synth dataset
SDR Bass SDR Drums SDR Other SDR Vocals SDR Instrumental SDR Vocals SDR Instrumental
Demucs3 (Model A) 9.50 8.97 4.40 7.21 13.52 --- ---
Demucs3 (Model B) 10.69 10.27 5.35 8.13 14.44 9.78 9.48

Note: For version A only MUSDB18 training data was used for training, so quality is worse than Demucs3 Model B. Demucs3 Model A and Demucs3 Model B has the same architecture, but has different weights.

🗎 Copy link | Use algorithm | Demo

  • ‹
  • 1
  • 2
  • 3
  • ›
MVSEP Logo

turbo@mvsep.com

Google Play App Store
Site information

FAQ

Quality Checker

Algorithms

Full API Documentation

Company

Privacy Policy

Terms & Conditions

Refund Policy

Cookie Notice

Extra

Help us translate!

Help us promote!