Hi there, sorry for breaking things

I figured I should be a bit more public and responsive if I’m going to continue editing, and perhaps explaining more of what I’m doing will help why I make some of the mistakes I do.

I have a handful of apps that I use which depend on MusicBrainz data to work properly. Over time, pressure has grown on these apps to support using Discogs instead of MusicBrainz because of perceived better quality of data, especially missing releases for electronic artists who publish a lot on SoundCloud. In my opinion, that is a bad pivot and is going to lead to (is already leading to) some instability in the projects. I’m not against Discogs at all and there is some use for aggregating services, but there are also a lot of advantages to improving an existing ecosystem.

The project I’m working on takes a list of artists and tries to do a range of audits on their Digital Media with as much automation as I can without making huge errors. Automation is supported by n8n and a range of userscripts (some “controllers”, some “workers”, and some “action panels”). Data is supported by a local MusicBrainz mirror and a local Harmony server with an API I built on top.

  1. try to identify releases that have links but are missing barcodes. I use those links to pull release info from Harmony and then use that to help fill in the gaps.
  2. try to identify releases with links that have conflicting barcodes. I’m still working on improving automation here as it’s more tedious.
  3. identify releases with missing ISRCs and a valid barcode and link. Send those to Harmony Actions to import ISRCs.
  4. audit the recordings page and merge together matching recordings based, as much as possible, on the filled ISRCs.
  5. use any existing links to fill missing cover art
  6. Scan Deezer and Spotify via API for releases that do not exist in Musicbrainz. Send these releases to Harmony and start an import flow. This can be done by n8n workflows or by in-browser userscripts. Beatport and SoundCloud will be up next after that, but Spotify and Deezer APIs are just easier to work with.

As I’ve gone through this process, I’ve made incremental improvements and learned a lot more about how MusicBrainz “works”. I still have a lot of gaps in my understanding and processes but I appreciate those who have helped me along, particularly afrocat and chaban.

The biggest snags in my processes:

  1. trusting existing data too much - if existing links or barcodes or artist attributions are wrong and Harmony isn’t able to detect that, I will often kick off a flow that makes some bad assumptions
  2. artist collision - I’m working on a flow in n8n that I think will improve artist collision but after a few weeks of looking at it, I see that this is really a problem the music industry as a whole is struggling with
  3. MB rate-limiting - My best run so far was about 21,000 edits in 18 hours and I hope to go much further than that, but I spend a lot of time just solving which requests route where (I have a “MusicBrainz gateway” I created to help do some of the routing decisions).
  4. flooding review/approval queues - if i introduce errors, it may be much harder for other editors to catch them unless they subscribe to a fairly narrow range of entities, and even then, it is hard to review 100s of changes at a time. Whlie this is kind of backwards, I hope to turn to the edit approval processes once I have more of the automated adding working, so that I can then spend all my human time on reviewing the information being added or edited.

If you notice other patterns of mistakes, you should see in my comments that I’m working on those, but I’m very open to feedback and hope to learn more. If there is something you struggle with, we might be able to collaborate on that. I’m working through understanding github better but I intend to contribute to the existing projects and userscripts so we can all benefit if you find my flows helpful.

2 Likes

Am I understanding you correctly that this is a human-supervised process and not a fully automated bot?

ISRCs are pretty terrible identifiers, with music industry practices leading to lots of duplicate assignments but also shared assignments where they shouldn’t be (at least in the Musicbrainz definition of recordings).

That is quite a lot, and as a member of the community I will expect that you have at least looked at every edit you do. You can’t just fire off tens of thousands of edits and expect the community to review them for you.

2 Likes

It is human supervised.

To clarify, I don’t only use ISRCs. I also check for matching track lengths and acoustids. In a future version, I’m going to add deeper fingerprint comparisons for tricky examples. I would be surprised if I’ve done a single inappropriate recording merge but it’s possible.

I understand the desire for accuracy, but I think whether one has “looked” at every edit is a little hard to define. I visually track every step in the add/edit release process and most moves from step to step are controlled by hotkeys, but I still make mistakes sometimes. From my perspective, if the requirement is that every add or edit has every detail reviewed, that’s not a reasonable standard and doesn’t reflect the quality of data I already see and am correcting.

2 Likes

Nobody expects perfection, but we also shouldn’t blindly replicate bad data from streaming sites. If an artist is incorrectly merged on e.g. Spotify, tools like Harmony already make it quite easy to copy that mistake to our database, and the more you automate the process the less likely a user is to notice that in my opinion.

21k edits in 18 hours, that’s about three seconds per edit. Of course not all of those edits will be the same type, but at some point I have to doubt whether the editor actually had a look at the artists involved in an adding an release and at least doing a plausibility check for whether they’re actually by the same person.

When merging supposedly duplicate recordings, it’s perfectly fine to identify candidates based on title and length values (often that’s all we have after all), but I don’t think it’s unfair to expect the editor who wants to do the merge to have a look at the releases the recordings appear on an check if there maybe is a reason they are separated (such as one of the recordings being on a DJ-mixed release, which is not always identified by a disambiguation comment).

On artist collision, that’s exactly the kind of thing I’m wary of and want to get better at. I messed a few up when I first started but I learned quickly what kinds of cases to watch out for. As a simple example, I don’t process any common mononyms and I don’t trust Deezer. Eventually, I’ll be able to compare all the releases across all the platforms to increase confidence further. I also have some supporting scripts that report to me plausible mixed up artists and steer clear of them until I’ve verified links.

Yes, that’s a lot of edits. A majority of those come on new adds where there are a lot of recording links added and, because of the artists I’m focusing on, often brand new artists with no streaming links.

I do in fact look for things like release names, in particular trying to find matches between singles and the EP or Album they later appeared on. I steer clear of things that appear to be parts of a mix or compilation UNLESS that compilation is something like a label showcase where they’ve compiled singles from their artists.

That sounds good. Apologies if my posts seem discouraging, but it set off some alarm bells regarding mass imports and quantity over quality for me. Musicbrainz has had bad experiences with that sort of thing before.

4 Likes

Totally fair, that’s partly why I made the post. I’ve still only explained a small portion of how things work, but I needed to make myself visible so it’s clearer what’s actually happening. I have already panicked too many editors who are worried I’m going to destroy the entire database, but I think if you look at my edit history, you’ll be pleasantly surprised.

I’ll also add that I think every time someone has raised a concern or noticed an issue, I’ve fixed the immediate issue, tracked down the related errors, and then implemented fixes to stop it from happening. Sadly, a lot of the errors are still due to the human in the middle not catching things (e.g. a release date that aligns with the physical release but is not plausible for the digital release as its earlier than the copyright date).

edit: I should also add that I recently changed my approach to some artist credit problems. now, if i cannot confidently identify any artist and am not willing to check further, i don’t guess at all (previously, i might guess based on something like an “electronic artist” disambiguation but that’s no longer sufficient), i record that outcome and bail on the edit, saving it for review later. this is most common on large compilation albums where the impact of wrong credits can be very high.

Hello ligeia! I’m happy to see you reaching out to the community about this.
You edit a lot of artists and labels I’m either subscribed to, or have some interest in, so I end up reviewing and voting on a lot of your edits, but as I mentioned a few months ago, I just can’t keep up with everything you do!
I suspected there was some form of automation going on, but for the most part it looked like pretty sensible edits that I would do as well, so I never really brought it up.

MB has had some bad experiences with automated editing in the past, so a lot of editors are naturally wary of fully automatic edits (see Editor “Likedis Auto” - MusicBrainz and Tag “likedis auto” - MusicBrainz)
I think as long as you yourself review the edits before they go through, it should be fine.

  1. flooding review/approval queues - if i introduce errors, it may be much harder for other editors to catch them unless they subscribe to a fairly narrow range of entities, and even then, it is hard to review 100s of changes at a time

Sadly there aren’t a lot of editors who actively vote on edits. I try to go through some people’s open edits at least once a week, but I’m not always in the mood to stare at scrolling text on my screen for a few hours at a time :laughing:
That’s just generally how MB editors are, the vast majority are not interested in voting and reviewing - you can see that fact reflected in the editor stats page.
To anyone else reading this: you should go vote on some edits!

if i cannot confidently identify any artist and am not willing to check further, i don’t guess at all, i record that outcome and bail on the edit, saving it for review later. this is most common on large compilation albums where the impact of wrong credits can be very high.

I think that’s a situation we’re all too familiar with… especially annoying when you’re tracking down an artist with the most common name imaginable that has only ever released 2 songs and has no social media at all :laughing:
I think in those situations it’s fair to just create a new artist with a descriptive disambiguation so someone else down the line can track them down a bit easier

I have a handful of apps that I use which depend on MusicBrainz data to work properly. Over time, pressure has grown on these apps to support using Discogs instead of MusicBrainz because of perceived better quality of data,

I don’t know what these apps could be, but if they have a large community, I’d imagine there are others interested in helping out with entering data to MB.
Since you seem to edit primarily electronic music, I’d be happy to lend a hand either importing or cleaning up releases, if you already have a list of things to do. There may be some people in the MB discord server who may be interested as well

2 Likes

Is it possible to “human supervise” this amount and this duration? :wink:

1 Like

If it helps folks understand, I can maybe do a video to show more of how it works right now. A note on the number of edits and duration of the run: as I mentioned, a lot of those are “easy” edits. Consider an example: I add a new LP with 10 songs, and there are 30 distinct artists associated with that release (for each song, 1 main, 1 feat., 1 remix). If we stitched together enough streaming platform information, that flow can easily generate:

  • 50 streaming links
  • 150 artist links
  • ISRCs and Cover Art

I do some basic verification of those, but none of those edits need to be “babysat” as (sorry) I am never going to verify every individual streaming link unless I suspect there’s a bigger problem. I can also can generate very fast “edit relationships” for remix artist credits because I have an improved version of the “semi-automated remix credits” script that many people use.

The links associated with those particular actions go into a link opener and queue manager. That means, for example, I can look at what’s coming up in the queue for a simple sanity check, walk away while a huge run of recording links are being added, come back and make sure nothing broke, and resume the human-necessary stuff for the next release, often while there are things stlll processing in the queue. I don’t need to physically sit there for the whole 18 hours, for example, clicking every single step, but I don’t just let everything run on its own, either.

To afrocat’s point about others willing to help: yes they are but many complain the barrier to getting started on musicbrainz is too high, and some have said they are not willing to learn something that, in their opinion, will end up getting replaced anyway. While all of the activity is really scary to the editors actually having to do the work and volunteer their time, I think it’s exciting to people who just want to use the site and I had hoped to encourage more people to jump in.

My hope is to build tools so that there are versions for experienced editors who want to do a high volume of edits, experienced editors who want to identify and fix partculary thorny or “low ROI” problems, and then a set of tools for less experienced editors that include more guardrails and/or warnings to reduce missing or poorly filled data. A lot of people find MusicBrainz kind of esoteric and I think I can help them with that, even just by walking users through fields or by highlighting more clearly what seeded data (or session data in my “monitor everything” version) can be entered in what order, and includes things like improved field grabbing, faster and broader tracklist fixing, warnings about artist collision based on expected release dates and genres, and so on.

edit: also because afrocat has seen so many of the edits, I think they can attest ot the fact that there were some errors in the beginning, mostly because I didn’t really understand some of how MusicBrainz works, but we’ve progressed to a new, narrower class of much more minor mistakes.

1 Like

Not to sidetrack the conversation too much, but I’m surprised to hear that people are looking to discogs for the types of releases you describe, as digital/streaming seems to me one of its weakest areas. Is it a perception that discogs is easier to contribute to?

4 Likes

No, it’s mostly a bad inference from OTHER kinds of data. They see lots of Discogs coverage for certain popular artists and then infer that there must be good coverage in other areas, too. Sometimes people fail to distinguish between physical and digital releases and infer that Discogs good physical coverage means that it’s more reliable in general.

I’ll give an example that I first started this project with: the label bitbird. Virtually every release from bitbird is on discogs and only maybe 1/2 of them were on musicbrainz (I think it’s inverted now, MB is far more complete). At one point, bitbird was very “SoundCloud first” but that changed over time.

For many of the artists I’m working on right now, neither site has very good coverage, in my opinion, but that’s because I’m chasing the low-hanging-fruit at this stage in the project.

edit: I should also add that the UI plays a big role, discogs “looks” easier to many people. Personally, I don’t care that much about aesthetics and I find MusicBrainz data and DOM structures really easy to work with, so that’s partly why I prefer MB.

3 Likes

Music eShops often report a wrong release date.

Shouldn’t your user be marked as a bot?

Or make your bot edits with a new bot user and your manual edits with your current user?

1 Like

I’d be happy to do that, It does add some complication for me, because (at least right now) there are lots of edits i might make during a run that are far less autonomous, and then I’d have to switch back and forth between accounts or keep separate profiles open which makes it harder to watch the flow of activity.

I also think that question is roughly the same as what we’ve already been struggling to define. To me, a bot runs with no human intervention and that’s not what’s happening here (and if that is accomplished, I would absolutely create a separate account that does all that work, and I’d go through the process of making a formal relationship with the MB team to establish better server interactions).

The definition is “programs which automatically enter edits on behalf of a human.”Does that mean it’s a bot if it clicks the “Enter edit” button on any page? If so, that implies a lot of userscripts out there are bots.
Does that mean it’s a bot if it does all the data “lifting”?
Does that mean if it’s a bot if it advances through every screen but always requires the user to click enter edit themselves?

I’m not trying to be difficult and I am happy to accommodate what works, I just don’t fully understand. I am genuinely trying to be helpful and I’m sorry if it’s not panning out that way.

1 Like

Hello, ligeia ! Thank you for contributing to MusicBrainz! I appreciate the care you put into it, as demonstrated by your careful explanation of your choices, and by the fact that you came to get feedback at all.

Have you considered setting up a replica MusicBrainz server? If you have looked into it, great! If you haven’t, I can tell you that I set up my own a few weeks ago and it went pretty smoothly (if you are good at system administrator type tasks). The benefit is that you have a MusicBrainz server which has data that is at most an hour out of date, and which won’t object if you make API calls to read data at a high rate. It doesn’t help with writing data to MusicBrainz, however. For that, you need to talk to the mother at her moderate rate.

I hope this is helpful,
—Jim DeLaHunt

Thanks for the kind response, Jim.

I do run a mirror (using the docker image provided) and that saves me a lot of queries. However, that 1 hr window is difficult for certain kinds of flows. For example, if I edit a release where I add a url or barcode where there weren’t any before, my Harmony server won’t see those changes unless it checks the production MusicBrainz server. During barcode audits, this happens to me a lot.

Two solutions for this:

  1. I use my “musicbrainz gateway” to make routing choices. If I’m just adding a release, that 1 hr window is easy to deal with: mirror returns a 404 for that mbid and the gateway routes to production. If i’m editing a release, the gateway has to check production for a recent edit, and then decide if sending to production is more appropriate. The Harmony to MB prod calls have to be spaced out to avoid rate-limits or Harmony will start failing to return data. This can be solved by programatically reloading on failed states but that’s just being lazy and making the problem worse.
  2. Drop any Harmony reliance or short-circuit Harmony Actions for certain kinds of checks, e.g. ISRC and cover art when I know I have those. However, Harmony has been up and running for a longer period of time, has many people watching it, and my plan is to help with that project instead of adding another layer that users would have to implement (because hopefully one day, all of this stuff is public). I built my own Harmony API and one thing I have on my scratchpad is to look at having the API return individual pieces of information, e.g. only streaming links, which would allow me to manage magicISRC and my cover art flows without visiting a Harmony page at all.

Could caching help here too?

One option could be to implement a cache where you save your modifications locally to cache at the same time they’re written to MB. That plus a typical read-through cache could make this rather transparent to the rest of the workflow.

ISRCs are bad, agreed!, but this feels like a reasonable inference to make in a workflow. Basically: given a strong release identifier and series of recording titles, query a DSP for their associated ISRCs? Hard to think of a much better way to do this, other than having direct access to DDEX ERNs and pulling them from there.

The really bad things you can do with ISRCs are:

  1. Assume a recording only has a single ISRC, and try to use it as a natural key for recording identity, esp across systems.
  2. Try and derive an ISRC from (artist, recording title) alone, esp if goal is to use it as a recording identifier.

Treat ISRCs like strong identifiers and they’re gonna make you angry. Librarians have a useful concept called “access points”, think it’s a much more realistic way to frame ISRCs.

1 Like

Thanks for the suggestion. That is a possible implementation. I mentioned earlier a '“monitoring mode” and this is an approach where we basically build a cache of information related to a release as we go through the flow. by the time you arrive at later steps (e.g. Harmony’s Actions page), we already know how things used to look and how they look now. This approach also makes it easier for me to catch and resolve unexpected issues (for example, it takes some extra CORS to make Harmony update blank track lengths so I’ve just left that alone for now.

On ISRCs, I currently use 1) matching ISRC, 2) track length within 2s, 3) no acoustid conflicts, 4) plausible release name. I toyed with a version that calls acoustid by api and does additional fingerprint checks (including using a file that might have been scraped from one of our sources) but that’s a ways down the road still.

Really enjoying the input everyone, thank you.