HeartMuLa (Song Gen)

HeartMuLa — is an advanced open-source family of multimodal foundation models (Apache 2.0 license) designed for high-quality music synthesis and audio processing. Unlike proprietary cloud services (such as Suno or Udio), HeartMuLa gives developers full control over the generation process and the ability to run it locally on their own hardware. The model operates on an LLM architecture and allows for the creation of complete compositions from text prompts.

Official repository: https://github.com/HeartMuLa/heartlib

Key Features

Multilingual vocal generation: Supports speech and singing synthesis in multiple languages, including English, Chinese, Japanese, Korean, and Spanish.
Fine-grained structural control: The use of special tags in the lyrics (for example, [Intro], [Verse], [Chorus], [Bridge]) allows for precise control over the arrangement and progression of the composition.
Musical attribute control: The model understands complex descriptions perfectly. You can specify the genre (rock, jazz, R&B, metal), timbre (dark, bright, soft), emotions, and specific instruments.

Architecture: How It Works Under the Hood

The system is not monolithic; it is a complex of several specialized neural networks combined into a single audio processing pipeline:

HeartCodec (Neural Codec): The foundation of the system is a music tokenizer operating at an extremely low frequency (12.5 Hz). It provides the highest accuracy of signal reconstruction (high fidelity) with a minimal amount of data. This is critical for the language model to efficiently generate long audio fragments in an autoregressive mode.
HeartCLAP: An audio-text alignment model. It creates a unified embedding space, thanks to which a textual description like "a sad melody on an acoustic guitar" is mathematically mapped accurately to the required acoustic characteristics.
HeartTranscriptor: A module based on the Whisper architecture, fine-tuned specifically for transcribing lyrics and extracting phonetic features from vocals.
HeartMuLa Generator: The main LLM model with a three-level architecture:
- The global backbone processes text tokens and audio encodings.
- The local decoder is responsible for the direct synthesis of music based on hidden states.
- The detokenizer converts the generated tokens back into a continuous sound wave (waveform).

A multi-encoder strategy was used for training: the model extracts data from pre-trained Whisper, WavLM networks, and its own MuEncoder, which allows analyzing sound simultaneously at the phonetic, semantic, and acoustic levels.

Important note: For the sake of AI ethics, according to the developers' technical documentation, an invisible digital watermark is embedded in the generated tracks to identify the machine origin of the audio.

Guide: How to Properly Format Lyrics and Use Tags

When working with AI models for music generation, the text (lyrics) serves two functions: it tells the model what to sing, and meta-tags in square brackets [...] indicate how to sing it and how to build the structure of the track. The model treats the tags as directorial cues.

1. Basic Structural Tags (The Skeleton of the Song)

These tags break up solid text into logical musical blocks. They should be written on a new line before a block of text.

[Intro] — Introduction. Sets the mood and tempo before the vocals begin. Lyrics are usually not written under this tag, or short atmospheric phrases or vocalizations are added (for example, Ooh-ooh).
[Verse] (or [Verse 1], [Verse 2]) — The verse. The story unfolds here. The music in the verses is usually calmer, and the rhythm is steady. Using numbering helps the model understand that the melody should repeat, but the text will be new.
[Pre-Chorus] — The pre-chorus. A transitional part where the tension and density of the instruments build up before the main climax.
[Chorus] — The chorus. The main idea and the most memorable melody. Here, the model usually delivers maximum emotion, sound density, and vocal expression.
[Bridge] — The bridge. Inserted closer to the end of the song (usually after the second chorus). In this part, the melody, rhythm, or key changes dramatically so that the song does not feel monotonous.
[Outro] — The coda (ending). A smooth fading of the music (fade-out) or a beautiful final chord.

2. Instrumental and Stylistic Tags

You can control not only the structure but also the arrangement at specific moments in time.

Solos and interludes: Use tags like [Guitar Solo], [Piano Interlude], [Bass Drop], or [Drum Fill] between verses and choruses. You do not need to write lyrics under them.
Vocal directions: If the model supports it, you can specify the performance style before a line: [Whisper], [Scream], [Spoken], [Choir].
Backing vocals and echoes: To add backing vocals or choral responses, enclose words in parentheses. For example: Lead: Walking down this lonely road Backing: (lonely road)

3. Golden Rules for Writing the Lyrics

Even with perfect tags, the AI can get confused if the text itself is poorly structured.

Symmetry and rhythm: AI models rely on the number of syllables. Try to ensure that lines in the same verse have approximately the same number of syllables and a clear meter. If one line consists of 5 words and the next of 15, the model will start "mumbling" words or break the rhythm.
Punctuation is breath: Commas , and periods . act as pauses. If you need the vocalist to take a breath or pause before an important word, insert a comma. The lack of punctuation will force the AI to sing in a tongue-twister fashion.
Language: As discussed earlier, write your lyrics strictly in English (or another officially supported language) to avoid phonetic "garbage" and accents.
Separation: Be sure to leave a blank line (line break) between different blocks (between the verse and chorus).

4. Perfect Template (Example)

Here is what a properly formatted prompt for generation should look like:

[Intro]
[Verse]
You could, you could take me to a place that's new,
A wave or a cloud, holding hands with only you.
You could, you could tell me that you have it all,
Everything I wanted, catching me before I fall.

[Chorus]
Take me out for a ride,
Along the docks or in the wood,
You could make me sick inside,
In just a single word you could.

[Chorus]
The sky would think it's right,
It loves the simple things,
My heart against yours tight,
If you love the simple things.
No, let's not do what others do
No, let's not do what others do
La, lalala, lalala
If you love the simple things
La, lalala, lalala, lalala la la la

[Verse]
You could, you could take me out to eat somewhere,
The finest of tables, honestly I do not care.
You could, you could use your charms to make me yield,
Step right up to me and make me drop my shield.

[Chorus]
Take me out for a ride,
Along the docks or in the wood,
You could make me sick inside,
In just a single word you could.

[Bridge]
Take me my boy, let me take a little bite,
Behind a deep feeling hides a man of great might.
Make me fly my boy, tell me if we are alright,
Behind a deep feeling hides a man of great might.

[Chorus]
Take me out for a ride,
Along the docks or in the wood,
You could make me sick inside,
In just a single word you could.

[Chorus]
The sky would think it's right,
It loves the simple things,
My heart against yours tight,
If you love the simple things.
La, lalala, lalala
If you love the simple things
La, lalala, lalala
La la la la la la la

Tag Guide (Prompt Engineering)

This guide is based on the analysis of the HeartMuLa research paper (Sections 3.2 and 6.2). The model uses a natural language tokenizer (Llama 3) rather than a fixed dictionary. To achieve stable generation, select tags from the 8 primary categories used during training.

The 8 Pillars of Training

Each category has an importance percentage representing its "selection probability" during training.

Training frequency: Tags were "sampled" (selected) during the training process. Genre was included 95% of the time, while Instrument was only included 25%.
Model expectations: For proper operation, the model expects the presence of a genre tag. Without it, the generation lacks a clear structural anchor.
Influence and stability: A higher percentage means greater stability. A tag with a 95% probability (Genre) is a "strong anchor," whereas a tag with 10% (Topic) is a "weak hint" that may be ignored if it conflicts with stronger tags.
Strategy: For maximum control, rely heavily on the top 4 categories (Genre, Timbre, Gender, Mood). Use low-percentage tags only as "seasoning" after the main structure is set.

Official Categories

GENRE (95% — MANDATORY)

Examples: Pop, Rock, Electronic, Hiphop, Jazz, Classical, Techno, Trance, Ambient.

TIMBRE (50% — Sound Texture)

Examples: Soft, Warm, Husky, Bright, Dark, Distorted.

GENDER (37% — Vocal Character)

Examples: Male, Female.

MOOD (32% — Emotional Vibe)

Examples: Happy, Sad, Energetic, Joyful, Melancholic, Relaxing, Dark.

INSTRUMENT (25% — Dominant Sounds)

Examples: Piano, Synthesizer, Acoustic Guitar, Electric Guitar, Bass, Drums, Strings, Violin.

SCENE (20% — Listening Context)

Examples: Dance, Workout, Dating, Study, Cinematic, Party.

REGION (12% — Cultural Influence)

Examples: K-pop, Latin, Western.

TOPIC (10% — Lyrical Theme)

Examples: Love, Summer, Heartbreak.

For convenience, all 8 categories and their possible tags are presented as separate selection options on the website. You can skip them and enter your own set of tags in the Tags (optional) field.

Prompting Strategy: "Less is More"

To maintain a strong anchor and avoid "probability interference," do not use conflicting tags.

Semantic conflict: The prompt "Rock, Jazz" scatters the model's attention, which often leads to "muddy" or unexpressive, generic arrangements.
Anchor stability: One strong anchor provides a clear roadmap. Multiple genres create conflicting maps, causing the AI to lose focus.
Recommendation: Choose only one tag for each category. Be precise and avoid overly broad concepts.

Recommended Format

Use a comma-separated list.

Examples:

Electronic, Techno, Synthesizer, Dark, High Energy, Club
Pop, Piano, Female, Sad, Soft, Love, Acoustic

🗎 Copy link Use algorithm Demo

Demucs3 Model (vocals, drums, bass, other)

Algorithm Demucs3 splits track into 4 stems (bass, drums, vocals, other). The winner of the Music Demuxing Challenge 2021.

Link: https://github.com/facebookresearch/demucs/tree/v3

Quality table

Algorithm name	Multisong dataset					Synth dataset
Algorithm name	SDR Bass	SDR Drums	SDR Other	SDR Vocals	SDR Instrumental	SDR Vocals	SDR Instrumental
Demucs3 (Model A)	9.50	8.97	4.40	7.21	13.52	---	---
Demucs3 (Model B)	10.69	10.27	5.35	8.13	14.44	9.78	9.48

Note: For version A only MUSDB18 training data was used for training, so quality is worse than Demucs3 Model B. Demucs3 Model A and Demucs3 Model B has the same architecture, but has different weights.

🗎 Copy link Use algorithm Demo

Vit Large 23 (vocals, instrum)

Experimental model VitLarge23 based on Vision Transformers. In terms of metrics, it is slightly inferior to the MDX23C, but may work better in some cases.

Quality table

Algorithm name	Multisong dataset		Synth dataset		MDX23 Leaderboard
Algorithm name	SDR Vocals	SDR Instrumental	SDR Vocals	SDR Instrumental	SDR Vocals
Vit Large 23 (512px) v1	9.78	16.09	12.33	12.03	10.47
Vit Large 23 (512px) v2	9.90	16.20	12.38	12.08	---

🗎 Copy link Use algorithm Demo

MVSep MelBand Roformer (vocals, instrum)

Mel Band Roformer - a model proposed by employees of the company ByteDance for the competition Sound Demixing Challenge 2023, where they took first place on LeaderBoard C. Unfortunately, the model was not made publicly available and was reproduced according to a scientific article by the developer @lucidrains on the github. The vocal model was trained from scratch on our internal dataset. Unfortunately, we have not yet been able to achieve similar metrics as the authors.

Quality table

Algorithm name	Multisong dataset		Synth dataset		MDX23 Leaderboard
Algorithm name	SDR Vocals	SDR Instrumental	SDR Vocals	SDR Instrumental	SDR Vocals
Mel Band Roformer v1 (vocals)	9.07	---	11.76	---	---

🗎 Copy link Use algorithm Demo

LarsNet (kick, snare, cymbals, toms, hihat)

The LarsNet model divides the drums stem into 5 types: 'kick', 'snare', 'cymbals', 'toms', 'hihat'. The model is from this github repository and it was trained on the dataset StemGMD. The model has two operating modes. The first (default) applies the Demucs4 HT model to the track at stage one, which extracts only the drum part from the track. On the second stage, the LarsNet model is used. If your track consists only of drums, then it makes sense to use the second mode, where the LarsNet model is applied directly to the uploaded audio. Unfortunately, subjectively, the quality of separation is inferior in quality to the model DrumSep.

🗎 Copy link Use algorithm Demo

Algorithms

HeartMuLa (Song Gen)

Key Features

Architecture: How It Works Under the Hood

Guide: How to Properly Format Lyrics and Use Tags

1. Basic Structural Tags (The Skeleton of the Song)

2. Instrumental and Stylistic Tags

3. Golden Rules for Writing the Lyrics

4. Perfect Template (Example)

Tag Guide (Prompt Engineering)

The 8 Pillars of Training

Official Categories

Prompting Strategy: "Less is More"

Recommended Format

Demucs3 Model (vocals, drums, bass, other)

Vit Large 23 (vocals, instrum)

MVSep MelBand Roformer (vocals, instrum)

LarsNet (kick, snare, cymbals, toms, hihat)

Site information

Company

Extra