Four music datasets are shaping AI music training

OraCore Editors

[IND] June 20, 20265 min readOraCore Editors

Four music datasets are shaping AI music training

4 music datasets with 21M+ tracks are circulating in AI training, and the legal fight is now moving toward licensing deals.

Share LinkedIn

Four music datasets are shaping AI music training

Four large music datasets are shaping how AI music models get trained.

Four datasets with more than 21 million recordings are circulating among AI developers, and the split between research use and commercial use is now central to the fight over music AI.

Item	Tracks	Public origin	Notable note
LAION-DISCO-12M	12 million+	Yes	Links to public YouTube tracks and metadata only
Large unnamed dataset	9 million	No public origin cited	One of the two biggest collections
Free Music Archive	100,000+	Yes	Used by Google and Stability AI, per The Atlantic
Unnamed small dataset	100,000+	No public origin cited	One of the two smaller collections

1. LAION-DISCO-12M

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The biggest publicly documented collection in the report is LAION's LAION-DISCO-12M, a dataset of more than 12 million tracks released in November 2024. It was built by the German nonprofit for research, not for shipping a commercial music product.

LAION says the dataset is for academic settings and warns against using it commercially or in its original form for finished products. It does not distribute audio files; it provides links to publicly available YouTube tracks plus metadata.

12 million-plus tracks
Released in November 2024
Research-only framing
Metadata and links, not audio files

2. The 9 million-track collection

One of the two biggest datasets in the report holds roughly 9 million tracks, but The Atlantic did not identify a public origin for it in the article summary. That opacity is part of the problem for labels and artists trying to trace where training data comes from.

Its size matters because this is the scale where a dataset can influence model behavior across genres, eras, and artist catalogs. The report says the four datasets together include music by Bad Bunny, Nirvana, Taylor Swift, Billie Eilish, Pearl Jam, and the Beatles.

About 9 million tracks
Publicly cited by The Atlantic, not fully sourced in the article summary
Part of the group downloaded several thousand times
Contains copyrighted music, according to the report

3. Free Music Archive

The Free Music Archive is the clearest example of a dataset that began as a research resource and later became useful for AI training. It was published by academic researchers in 2017 for music-information-retrieval work, the kind of software research that focuses on searching, sorting, and analyzing music.

The archive draws on a catalog directed by WFMU, a freeform U.S. radio station whose artists had already released tracks under permissive Creative Commons licenses. That licensing history matters because the material was openly shared long before generative AI systems began training on it.

100,000-plus tracks
Academic release in 2017
Built from Creative Commons-licensed music
Used by Google and Stability AI, according to The Atlantic

4. The other 100,000-plus dataset

The fourth collection is another dataset with roughly 100,000 tracks, but the report does not name it in the excerpted text. Even so, it helps show the range of sources AI developers have been drawing from: some datasets are openly documented, while others are far harder to audit.

That gap between public documentation and actual usage is why the legal dispute keeps widening. The Atlantic report notes that all four datasets have been downloaded several thousand times, yet the industry still keeps much of its training data hidden.

100,000-plus tracks
Unnamed in the report excerpt
Downloaded several thousand times
Illustrates the audit problem in AI music training

5. What the lawsuits and licenses mean

The datasets matter because they sit inside a larger legal shift. Udio and Suno are facing at least 12 lawsuits, while major rightsholders have started moving from pure litigation to licensing. Universal Music Group settled with Udio in October 2025, and Warner Music Group followed with its own Udio deal and then a first-of-its-kind partnership with Suno.

Those agreements point to a future where some AI music tools may operate inside licensed systems rather than open training pipelines. At the same time, Sony Music remains in court, and independent artists and groups such as the American Federation of Musicians are still pressing claims over uncredited or uncompensated use.

UMG settled with Udio in October 2025
Warner settled with Udio in November 2025
Warner then settled with Suno
Sony Music remains in active litigation

How to decide

If you care most about scale, LAION-DISCO-12M is the headline dataset. If you care about provenance, the Free Music Archive is the clearest case of a research dataset with known licensing roots. If you care about where the market is heading, the Udio and Suno settlements matter more than any single dataset because they show the industry moving toward licensed AI music systems.

For readers tracking risk, the main signal is not just how many tracks sit in a dataset, but whether artists, labels, and platforms can see how that data was gathered and used. That transparency gap is now the core issue.

// Related Articles

Four music datasets are shaping AI music training

1. LAION-DISCO-12M

Get the latest AI news in your inbox

2. The 9 million-track collection

3. Free Music Archive

4. The other 100,000-plus dataset

5. What the lawsuits and licenses mean

How to decide

Clip Converter’s 2026 rivals are faster and safer

OpenAI’s Sora shutdown proves hype can’t outrun unit economics

Anthropic’s model shutdown shows safety can bite back

Boy George AI vs Taylor Swift rerecordings

Deezer is right to expose AI music in playlists

Claude Partner Program turns Anthropic into a channel