I want to make a song recommender using MusicBrainz’s data. I’m am not really concerned with scalability to many users, but I would like to have all United States (for starters) artists/song data so that my back end can be something non-trivial. What I am wondering and would greatly appreciate anyone’s thoughts on is: What are the best data collection (direct download, api, server?) and storage methods (CSV, SQL DB, etc.?) for this? I have read from SQL databases before but never maintained one myself and it seems like a lot of overhead compared to CSVs, as unsightly as that may be. Also, I would like to get lyrics to make subject-matter a part of this recommender. Genius is an obvious place to go, but I’m no lawyer and their TOS doesn’t seem like it would welcome this. So, does anyone know of a good way or place to get lyrics to lots of songs? Thank you in advance for anything!
Edit: I don’t need to display the lyrics. I would just like to use something like NLP to get a summary or a few terms and use those as a factor in the recommendation process.
Lyrics are a legal mess and you almost certainly would need to pay someone to license them.
“All US artists and music” is a crazy amount of data to deal with in CSV form, so SQL would probably be better for you. If you want you could also use the JSON data dumps, but if you want the data to be kept up to date I’d personally still do SQL unless there’s a good reason you cannot.
I don’t think that you will find a legal way to harvest existing lyrics sites for free.
If you are willing to pay for it, you could try https://developer.musixmatch.com/ (They call themselves quite modest “…the world’s leading music data company” and “with the world’s largest lyrics database with more than 14 million lyrics in over 50 distinct languages”.)
Thanks very much for the link. I’ll look into this more myself and update if I find anything, but in case you know and for anyone else finding themselves asking this type of questions which I think are reasonable:
How does Genius get away with hosting lyrics which anyone can fill in (and then go and take legal action on IP that’s not there’s either, which I believe is why the supreme court turned down their case against Google)?
Does it make a difference if I don’t need to display the lyrics? https://developer.musixmatch.com/plans makes a point about making sure the lyrics are displayed legally. I will be editing my question to reflect his, but my primary interest in lyrics for this project would be to preform NLP and get and use the result of that as a factor in the recommendation process. Displaying lyrics would be a plus, but is not necessary.
As far as I understand it, the difference in the two Musixmatch price plans “Free” and “Plus” is this sentence:
Free: We [Musixmatch] ensure lyrics are legally displayed, managing worldwide copyrights
Pro: [You] Legally display lyrics, we take care of copyright worldwide
As you only get 30% of the Lyrics per song in the “Free” plan, this is only useable for testing and evaluation. I don’t think that you can feed NLP with only 30% of the Lyrics text, but you have to test it to be sure.