The universally unique identifier system underlying the MBID makes a probabilistic promise: create the MBID correctly, and the likelihood that it will be the same as another MBID, past or future, is vanishingly small. Not quite zero, but really, really unlikely.
The same also applies to abbreviated MBIDs. The shorter the abbreviation, the greater the risk of collision, but that risk is low even for short abbreviations.
How low? I was curious, so I counted. I exported the MBIDs of all 2 million+ artists, wrote a small program to compare each artist’s MBID to every other artist’s MBID, and counted how many digits of the MBID pair were enough to distinguish the two. I expressed this as a percentage:
- 93.75% of pairs are distinguished by just 1 digit of MBIDs
- 99.609% of pairs are distinguished by just 2 digits
- 99.9756% of pairs are distinguished by just 3 digits
- 99.99848% of pairs are distinguished by just 4 digits
- On average, 1.067 digits of MBID are enough to distinguish any two Artists.
Out of 3.8 trillion pairs of 2.8 million Artist MBIDs, 57.9 million pairs require 5 or more digits, which is less than two thousandths of one percent. The most similar MBIDs are two pairs which each require 11 digits to disambiguate.
So, I could have written this about Paul Moore: “Paul Moore artist/c and Paul Moore artist/8 may be the same person.….”
And I had compared only pairs of MBIDs where the Artists had the same sortname, or had the same first word of sortname, or the same first few letters, even fewer digits would have been necessary. I could modify my program to count this if there is interest.
I was also curious if including more MBIDs would change the numbers. To my surprise, the averages changed hardly at all, though the rare extremes got longer. I compared pairs of MBIDs from 2,754,845 Artists, 5,158,950 Releases, and 37,192,373 Recordings. The percentages were identical for the first 4 digits:
- Again, 93.75% of pairs are distinguished by just 1 digit of MBIDs
- 99.609% of pairs are distinguished by just 2 digits
- 99.9756% of pairs are distinguished by just 3 digits
- 99.99848% of pairs are distinguished by just 4 digits
- Again, on average, 1.067 digits of MBID are enough to distinguish any two Artists or Releases or Recordings.
- Out of 1.0 quadrillion pairs, one pair required 21 digits to distinguish; 6 pairs required 14 digits; and 49 pairs required 12 digits.
- Comparing just Artists, 0.0000003570000% of pairs required 8 digits to distinguish, but comparing Artists, Releases, and Recordings, 0.000000349% of pairs required 8 digits. (This tiny difference reassures me that my program did not just spit out the same answer both times.)