Alex Reisner, a reporter for The Atlantic, has identified four datasets of music used to train artificial intelligence models and made them fully searchable for the public through the publication's AI Watchdog site. Two of these datasets are enormous, containing 12 million and 9 million tracks respectively, while the other two are smaller but still substantial, each comprising over 100,000 songs.
Datasets Downloaded Thousands of Times
According to Reisner, the datasets have been downloaded thousands of times. Although it is impossible to determine exactly who has used them, Google and Stability AI have both confirmed in research papers that they have utilized these datasets. Some sources, such as the Free Music Archive dataset, are free to stream for personal use but require licensing for commercial applications.
How AI Developers Access the Music
While the datasets are theoretically freely available on the internet, using them as training data is not as simple as downloading a ZIP file and feeding it to an AI model. Reisner explains: "Three of the datasets I found are distributed as a list of links to songs on YouTube or Spotify. AI developers download the actual audio using tools that automate the job, some of which allow developers to bypass logins, advertisements, and mechanisms that might earn money or subscribers for creators. Such tools violate the terms of service of these platforms."
Artists Included in the Datasets
Names that appear in the datasets range from pop stars like Lady Gaga and Fred Again.. to Radiohead, Aphex Twin, Wu-Tang Clan, Bruce Springsteen, and experimental composer Hainbach. The Atlantic's AI Watchdog site allows users to search through the songs, books, and other media being used to train the world's AI models.



