Shortened MBIDs ("M-88ad") can be useful in discussion

I have found shortened MBIDs to be useful in discussion about MusicBrainz edits. It is particularly so when discussing two different Artists with identical names.

For instance, I once found two Artists named Paul Moore. I wrote an Annotation saying that I thought they were the same person. I referred to them using the first few digits of their MBIDs, as well as HTML hyperlinks to the full Artist URL:

Paul Moore artist/c030e4 and Paul Moore artist/88adb3 may be the same person.…. These two Artist entries should possibly be merged.

Why not write out the full MBID in the text? Because it is big, unwieldy… and not necessary.

I suggest we adopt a convention of abbreviated MBIDs in discussions where it makes writing easier. e.g.

Paul Moore M-c030 and Paul Moore M-88ad may be the same person.….”

Rationale and statistics follow.

2 Likes

The universally unique identifier system underlying the MBID makes a probabilistic promise: create the MBID correctly, and the likelihood that it will be the same as another MBID, past or future, is vanishingly small. Not quite zero, but really, really unlikely.

The same also applies to abbreviated MBIDs. The shorter the abbreviation, the greater the risk of collision, but that risk is low even for short abbreviations.

How low? I was curious, so I counted. I exported the MBIDs of all 2 million+ artists, wrote a small program to compare each artist’s MBID to every other artist’s MBID, and counted how many digits of the MBID pair were enough to distinguish the two. I expressed this as a percentage:

  • 93.75% of pairs are distinguished by just 1 digit of MBIDs
  • 99.609% of pairs are distinguished by just 2 digits
  • 99.9756% of pairs are distinguished by just 3 digits
  • 99.99848% of pairs are distinguished by just 4 digits
  • On average, 1.067 digits of MBID are enough to distinguish any two Artists.

Out of 3.8 trillion pairs of 2.8 million Artist MBIDs, 57.9 million pairs require 5 or more digits, which is less than two thousandths of one percent. The most similar MBIDs are two pairs which each require 11 digits to disambiguate.

So, I could have written this about Paul Moore: “Paul Moore artist/c and Paul Moore artist/8 may be the same person.….”

And I had compared only pairs of MBIDs where the Artists had the same sortname, or had the same first word of sortname, or the same first few letters, even fewer digits would have been necessary. I could modify my program to count this if there is interest.

I was also curious if including more MBIDs would change the numbers. To my surprise, the averages changed hardly at all, though the rare extremes got longer. I compared pairs of MBIDs from 2,754,845 Artists, 5,158,950 Releases, and 37,192,373 Recordings. The percentages were identical for the first 4 digits:

  • Again, 93.75% of pairs are distinguished by just 1 digit of MBIDs
  • 99.609% of pairs are distinguished by just 2 digits
  • 99.9756% of pairs are distinguished by just 3 digits
  • 99.99848% of pairs are distinguished by just 4 digits
  • Again, on average, 1.067 digits of MBID are enough to distinguish any two Artists or Releases or Recordings.
  • Out of 1.0 quadrillion pairs, one pair required 21 digits to distinguish; 6 pairs required 14 digits; and 49 pairs required 12 digits.
  • Comparing just Artists, 0.0000003570000% of pairs required 8 digits to distinguish, but comparing Artists, Releases, and Recordings, 0.000000349% of pairs required 8 digits. (This tiny difference reassures me that my program did not just spit out the same answer both times.)
1 Like

This seems clearer to me. Otherwise the rare times the shorthand is used as M-c030 it will cut people out of the conversation and take longer to explain each time.

Personally I’d describe the two people using their disambig. It is more naturally human understandable. If the disambig is not clear enough to tell them apart then better disambig it needed.

I suggest that we adopt some convention like this where abbreviated MBIDs make conversation easier. Seeing the percentages, I think a more concise convention would do fine:

  1. a prefix, say “M-” for any entity, “A-” to specify Artist, “Rel” for Release, “Rec” for Recording, etc
  2. four digits of MBID. Add more digits, if you happen upon the thousandth of a percent which requires more.
  3. Link one usage of the abbreviation to the full URL of the entity, so that readers can see the full MBID and the MusicBrainz entry, if they want

e.g. “Paul Moore M-c030 and Paul Moore M-88ad may be the same person.….”

1 Like

In the software development world, where groups of edits can be referred to by a different almost-always-unique string of digits called a “git commit digest”, this sort of usage is common. Instead of saying “commit 8760f6535c0b65cb9e4f4c6aa98d80ca6d8d464e”, it is common to say “commit 8760f65”. Even that abbreviation is usually unique in the context of a single project.

You seem to be overthinking a problem that is not really there. Remember this is a music database for music people. So things need to work for music people first, or those people doing the odd read of the database.

The suggestions you are making in this thread seem a bit too overly technical and would rely on an archaic knowledge of the database and database shorthand. That would make the database less understandable to someone walking past trying to look something up.

Quoting Git Commit code is a good example of talking language that very few people know what you are talking about. That makes zero sense to me as I don’t spend any time in a git forum. And would not have a clue as to where to head to understand what “commit 8760f6535c0b65cb9e4f4c6aa98d80ca6d8d464e” could possibly mean. Don’t make a shorthand that only a small number of people understand. You need language that is more inclusive to outsiders, not exclusive,

4 Likes

Like Ivan said, it’s always better to explain differences in annotations by understandable text than by a very obscure database ID, that can even change with time (with wrong direction merges).

I never felt the need of showing the MBID in disambiguations, it should even be avoided, IMO.

3 Likes

Personally I really like the idea of concise mbids (Github does it for commit hash too), although I can only find one use for it: Quick linking.

Like for the IRC and jira tickets, it would be nice that saying “M-cee9d” will automatically be converted to “M-cee9d [Technomancy by Caster]”, if an mbid starting by that exist in the page context

It would help a bit with disambiguation by providing the disambiguation of entities:
“M-2a481 [Moon (instrumental with violin) by Somebody (Artist from the UK)]”

It’s not the MBID being actually useful, but the fact you can easily make readable reference to an entity.

3 Likes

When I’m browsing MB, the quickest I can think of is right-clicking any entity link and press C key on keyboard to instantly copy its URL.

Then, it would be nice that this URL would be automatically converted, as you say, but with just a Ctrl+V (paste), without having to type M- followed by some weird characters.

At the moment, we have to rely on third party stuff, like the cool Annotation Converter bookmarklet (1 or 2 more clicks).

Actually? NVM. Fixed it myself

Turns
image

into
image

Missing the disambiguation but no more time today.

5 Likes

Wow, it could be so great!

I get many undefined links when there are many raw entity links.

I guess it’s because of rate limiting:

  • don’t break original link when error
  • cache results to prevent making x times same requests
  • rate limit requests
  • show entity comments
  • handle all entity types, like works: see Rae and Gaga

You have an odd definition of “easily readable”. All I see is replacing a long hard to read string of characters by a shorter hard to read string of characters, with the added drawback that you can’t see with the naked eye that the short version belongs to the long version.

2 Likes

Before the thread wanders too far, let’s remember that the proposal is that shortened MBID can (not always will) be useful in conjunction with entity names, if and when names aren’t enough — for example the entities have the same name, and no disambiguation strings, and an additional, precise reference is helpful.

The full MBID has always been available to use as a reference, but the suggestion is that a shortened MBID might be more easily readable than the full MBID. A convention like “M-” or “artist-” could in time communicate that the digits are a shortened MBID. A convention to link to the full MB URL when first using the abbreviation is a way to provide the full MBID without disrupting the flow of text so much. The counting exercise confirms that just four digits is precise enough over 99.99% of the time.

IMO the best on desktop and mobile, is plain full URL.
I don’t want to manually type a code (like the current annotation syntax), I just want to paste.

Then this URL should/can be embellished, ideally just once by MBS, at save time.

1 Like

@RustyNova or easier than caching results: Instead of listing links, list (MB entity) URLs without duplicates to loop into them instead of looping into all links.
And then once a URL info is fetched, convert all links using this URL at once.
So this request is done only once for the 50 edit page that may repeat the same links.

Update: Many entity types are missing. I’m swithcing back to COOL ENTITY LINKS. :wink:

1 Like