Four music datasets are shaping AI music training
4 music datasets with 21M+ tracks are circulating in AI training, and the legal fight is now moving toward licensing deals.

Four large music datasets are shaping how AI music models get trained.
Four datasets with more than 21 million recordings are circulating among AI developers, and the split between research use and commercial use is now central to the fight over music AI.
| Item | Tracks | Public origin | Notable note |
|---|---|---|---|
| LAION-DISCO-12M | 12 million+ | Yes | Links to public YouTube tracks and metadata only |
| Large unnamed dataset | 9 million | No public origin cited | One of the two biggest collections |
| Free Music Archive | 100,000+ | Yes | Used by Google and Stability AI, per The Atlantic |
| Unnamed small dataset | 100,000+ | No public origin cited | One of the two smaller collections |
1. LAION-DISCO-12M
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
The biggest publicly documented collection in the report is LAION's LAION-DISCO-12M, a dataset of more than 12 million tracks released in November 2024. It was built by the German nonprofit for research, not for shipping a commercial music product.

LAION says the dataset is for academic settings and warns against using it commercially or in its original form for finished products. It does not distribute audio files; it provides links to publicly available YouTube tracks plus metadata.
- 12 million-plus tracks
- Released in November 2024
- Research-only framing
- Metadata and links, not audio files
2. The 9 million-track collection
One of the two biggest datasets in the report holds roughly 9 million tracks, but The Atlantic did not identify a public origin for it in the article summary. That opacity is part of the problem for labels and artists trying to trace where training data comes from.
Its size matters because this is the scale where a dataset can influence model behavior across genres, eras, and artist catalogs. The report says the four datasets together include music by Bad Bunny, Nirvana, Taylor Swift, Billie Eilish, Pearl Jam, and the Beatles.
- About 9 million tracks
- Publicly cited by The Atlantic, not fully sourced in the article summary
- Part of the group downloaded several thousand times
- Contains copyrighted music, according to the report
3. Free Music Archive
The Free Music Archive is the clearest example of a dataset that began as a research resource and later became useful for AI training. It was published by academic researchers in 2017 for music-information-retrieval work, the kind of software research that focuses on searching, sorting, and analyzing music.

The archive draws on a catalog directed by WFMU, a freeform U.S. radio station whose artists had already released tracks under permissive Creative Commons licenses. That licensing history matters because the material was openly shared long before generative AI systems began training on it.
- 100,000-plus tracks
- Academic release in 2017
- Built from Creative Commons-licensed music
- Used by Google and Stability AI, according to The Atlantic
4. The other 100,000-plus dataset
The fourth collection is another dataset with roughly 100,000 tracks, but the report does not name it in the excerpted text. Even so, it helps show the range of sources AI developers have been drawing from: some datasets are openly documented, while others are far harder to audit.
That gap between public documentation and actual usage is why the legal dispute keeps widening. The Atlantic report notes that all four datasets have been downloaded several thousand times, yet the industry still keeps much of its training data hidden.
- 100,000-plus tracks
- Unnamed in the report excerpt
- Downloaded several thousand times
- Illustrates the audit problem in AI music training
5. What the lawsuits and licenses mean
The datasets matter because they sit inside a larger legal shift. Udio and Suno are facing at least 12 lawsuits, while major rightsholders have started moving from pure litigation to licensing. Universal Music Group settled with Udio in October 2025, and Warner Music Group followed with its own Udio deal and then a first-of-its-kind partnership with Suno.
Those agreements point to a future where some AI music tools may operate inside licensed systems rather than open training pipelines. At the same time, Sony Music remains in court, and independent artists and groups such as the American Federation of Musicians are still pressing claims over uncredited or uncompensated use.
- UMG settled with Udio in October 2025
- Warner settled with Udio in November 2025
- Warner then settled with Suno
- Sony Music remains in active litigation
How to decide
If you care most about scale, LAION-DISCO-12M is the headline dataset. If you care about provenance, the Free Music Archive is the clearest case of a research dataset with known licensing roots. If you care about where the market is heading, the Udio and Suno settlements matter more than any single dataset because they show the industry moving toward licensed AI music systems.
For readers tracking risk, the main signal is not just how many tracks sit in a dataset, but whether artists, labels, and platforms can see how that data was gathered and used. That transparency gap is now the core issue.
// Related Articles
- [IND]
Clip Converter’s 2026 rivals are faster and safer
- [IND]
OpenAI’s Sora shutdown proves hype can’t outrun unit economics
- [IND]
Anthropic’s model shutdown shows safety can bite back
- [IND]
Boy George AI vs Taylor Swift rerecordings
- [IND]
Deezer is right to expose AI music in playlists
- [IND]
Claude Partner Program turns Anthropic into a channel