So I’ve been wanting to dive into messing more around with the “raw” data of MusicBrainz for a while, and I’ve finally gotten started on it. One of the things I wanted was to see if it was possible to revive some of the bots that we used to have scour the database and fix up obvious derps and outdated data bits…
FresoBot is, code‐wise, a fork of @murdos’ fork of @lukz’ musicbrainz-bot written in Python (2). FresoBot has a few (so far minor) updates/touch-ups to the codebase, but most importantly, it has a new script/module: spotify_url_cleanup.py.
After some testing, I’m now letting the bot free on MusicBrainz.org. To begin with, I’ve only let it do 25 edits. I looked them over myself, and they all look good to me, but I’m not going to let it do more for now until everyone else has had a chance to look its edits over, so please give https://musicbrainz.org/user/FresoBot/edits a look.
If no one has complained about any of its edits by tomorrow morning (say, about 12 hours from now), I’ll let it do 50 more then. If no one has any complaints/objections 24 hours from that, I’ll let it do 100/day. @yvanzo tells me «there are about 5K URLs to be cleaned up», so it should be doable to get Spotify URLs cleared up in a couple of months at that rate, without flooding the edit queue at the same time.
PS. This bot is written on my own time and not as part of my MetaBrainz contract. The MetaBrainz Foundation is not involved in the bot’s development in any way.
There’s one already listed at https://musicbrainz.org/user/FresoBot - I’ll try and add more as I come up with them. Murdos used to do a lot Discogs data matching. Maybe I’ll try and revive some of those. I’ve also been thinking about using the Spotify data to add additional Spotify URLs to MusicBrainz (e.g., if a Release has a Spotify URL and the MusicBrainz and Spotify releases have the same amount of tracks and they’re called roughly the same in the same order, add Spotify track URLs to all Recordings; if a Release has a Spotify URL but any Artist(s) involved do(es) not and their names are the same/similar, add Spotify links to artists). Of course, this latter part will be much easier if all the Spotify URLs are (fairly) uniform, which they are not at all right now…
I just ran another 50 edits of the Spotify URL cleanup script. If no one has any complaints by tomorrow, I’ll start doing 100/day (up to the limits set by the Bot Code of Conduct).
@Freso: Just one more thing: It should check that the new URL doesn’t already exist before! In such case, the best option is to edit old URL relationships to use the existing clean URL instead.
Erratum: Actually such edit is fine even in this case as it merges both URLs. This issue holds for direct database update only, as in MBS-9597, not for edits entered by a bot, as here.
I’ve made and run a new script, exit_url_cleanup.py which made I guess around 200 edits. I double and triple and quadruple checked the output several times, incl. a handful of the actually created edits on test and on the live database when I let it loose there. Many of the URLs need further cleaning than what this does, but it’s a step closer to being usable URLs. (Also lots of relationships that need fixing, which I also hope to be able to make the bot do.) Anyway, just a heads up.
I still haven’t gotten any complaints about any edits the bot has made, so I’ve been thinking about upping this to 200/day starting next Monday. If no one has any objections, I’ll Make It So™.
Could FresoBot also think about the Amazon Art issue? Would seem a useful task for a bot to handle.
Locating when CAA cover is missing and an ASIN link in place, could the bot then be the one running around downloading from Amazon and then uploading to CAA?
It could be really handy as it could mark it as “FresoBot Automatically Uploaded Art from Amazon”
It could, but I wouldn’t. Artwork from Amazon is in many instances not correct, and I have no way to verify every edit made by the bot, and I don’t want to make an automated instance upload wrong data.
If the ASIN link is already wrong, the displayed cover is actually also wrong.
It doesn’t get “wronger” if you upload this cover to CAA.
The big advantage would be that such a cover is available “forever” on CAA, not only for the time remaining until Amazon starts to charge MB and others for this service.
With a comment like @IvanDobsky“FresoBot Automatically Uploaded Art from Amazon” everyone knows, that the quality and matching for this cover isn’t 100% guaranteed.
Even if the ASIN is right, the cover may be wrong (and, by extension, the wrong ASIN may actually provide the right cover art!). Saying that “this Release corresponds to this ASIN” is making a statement that the Amazon Standard Identification Number corresponds to our MusicBrainz Identifier. Uploading the cover art to CAA is making a statement that that cover art is the correct cover art for the release. Those are two very different statements that are independent of each other.
But if below the cover you see “provided by Amazon” you already expect it to be wrong. You’ll think: Ah, that’s about what the cover of this album looks like, but nobody uploaded the correct cover for this very release yet.
I don’t believe that the majority of MB users really think this way.
They assume: What I get/see here is basically correct. It’s an encyclopedia. So every link to an external source is accurate.
Or do you really assume that every link to other external sources like Wikipedia, Facebook, Soundcloud, iTunes, Google Play etc is just a “looks like” or “maybe true”?
No, you’re right. Most users probably don’t think that way and when they e.g. download the cover art via picard they expect it to be correct. Then again most of them probably don’t care much if the cover art is a bit different.
But I meant the editors and not the users. If an editor sees an uploaded cover art they probably expect it to be correct, but if they see a cover art provided by amazon they might feel the need to find and upload the correct cover art.
Not really, because you can always switch it to the matching release .-)
The risk that CAA is closing down is (IMHO) smaller then Amazon starts charging or prevent the direct linking to artwork on their servers.
It is wronger to have a wrong data.
Having no data is good, having wrong data is no good. Here is why:
Because having data (wrong or not), then chances that a human will edit is less than having no data.
Automatically setting wrong data will prevent human edits.
Human edits are either same low level (using linked cover without checking) or high level (eye check that it’s the correct version of the cover).