Using OCR on CAA images to get lyrics and seed relationships

A bit science fiction… I’ve an idea for quite some time…

We have CAA, we have scans, there are opensource OCR like tesseract. So what about automatically generate lyrics from cover art of type booklet? They later could be corrected by human editors.

Maybe it’s even possible to auto-generate some relationships from CAA scans by this way? A bot could create them, and they only get accepted when human editors vote “yes” for those?

Maybe a nice project for GSoC ?


Just keep in mind that all copyright/legal things related to Cover Art Archive are handled by the Internet Archive - none of MetaBrainz’s infrastructure touches files uploaded to the CAA. I think your dream is definitely plausible (and you’re not the first to mention something like that either), but it cannot and will not be something that MetaBrainz will host, since that would directly require MetaBrainz to “touch” the files. Add to that that lyrics are big copyright/legal nightmare as well, it’s just not something that MetaBrainz will be interested in, even if it wouldn’t conflict with the legal setup of how MetaBrainz works.

That said, I would personally welcome anyone to work on a 3rd party project for something like this. (Things like AcoustID is a wonderful example of how you can do things based on MB data without having any direct affiliation with it.) The data is made free for everyone, exactly so anyone can take it and work on cool stuff like this—it doesn’t mean MetaBrainz has to be directly involved with all offspring projects. :slight_smile: