I was exploring the MusicBrainz data with direct queries of my mirror database, and I noticed something curious: there appear to be over 2,000 Release Groups which have no Releases.
From a quick look at the edit histories, it seems that there are a couple of ways they got added. “Coffee EP” and “I Like You EP” were added years ago, seemingly just as a Release Group with no Release. (I always add a Release and let MB add the corresponding Release Group, so avoid this outcome.) The Max Payne RG was added together with a Release, but the Release was later edited to connect with a second RG, leaving this first RG stranded.
I don’t particularly want to take on the project to clean this up, but perhaps someone else will. FYI.
Here is the SQL query which counts the affected Release Groups, and its result:
select count(*) from release_group as RG left join release as R
on R.release_group = RG.id where R.release_group IS NULL;
count
-------
2090
Here is the query which lists the affected Release Groups, and its result:
select RG.gid, RG.name from release_group as RG left join release as R
on R.release_group = RG.id where R.release_group IS NULL LIMIT 5;
gid | name
--------------------------------------+--------------------------------------
4da05891-5073-4723-b78f-f3b007a72172 | Coffee EP
7717e5b4-47cc-480d-b4c9-e627df34f0f1 | I Like You EP
10b97063-3040-472d-8171-842a5829f26f | Shock
484a4f44-2595-4d86-af01-4e429894fb65 | Midnight -星を数える夜-
40dd2496-d483-4b35-a304-3ee29e91653e | Max Payne (Re-Engineered Soundtrack)
I found zero cases of the complementary situation: Release entries without a corresponding Release Group.
Good analysis and write-up. I want to add that there is not necessarily a problem with empty release groups. Upcoming albums yet to be released are common, and if a known released album exists (e.g. with a link to the Discogs master) I prefer it exists with no releases than be removed.
I don’t have a local copy running at the moment to generate the list, but I’m interested in helping clean it up. The example you noted where a release is moved to a duplicate group rather than the two groups merging should probably be the first focus—obvious duplicates, usually easy to spot by looking at the artist and RG edit history. Afterwards they could be inspected for being ‘upcoming’, ‘known-existing’, or ‘questionable/fake’.
If it’s helpful, I can run more queries. I have the local database on hand. The hard part is writing the query.
It’s easy to extend my query so that it makes a CSV file with all 2,090 Release Groups, including a link in the title column. Someone could open that CSV file in a spreadsheet or browser, click on each link in turn, and investigate.
What might be more targeted and useful, but also harder to craft, might be a query which searches for Release Groups RG1 which had an Add Release edit with a Release R, where R now is associated with a different Release Group RG2. Then the CSV file could have the MBIDs for and links to RG1, RG2, and R. If someone could identify what data is particularly helpful for coming up with a merge edit of RG1 into RG2 (just the hexadecimal MBID values? or links?) the query could include those columns.
I don’t promise to jump quickly to figure out that second query. But if someone figures it out, I’m happy to run it on my data, and pass back the resulting CSV file.
If you share a .csv file, I’m happy to begin looking through them and cleaning them up. The second query could isolate the list to ‘likely-duplicate’, but it’s not something I can do
Thank you for this reference! I guess I don’t need to share my simple list of these Release Groups. If I make the more sophisticated query, showing RGs which remain after their Releases were moved to a new RG, then I will update this thread.
This query appears difficult to run on my replica of the MusicBrainz database. The replication appears not to include the edit* tables, which would be a necessary pivot for this query. And, if I can’t run the query, I can’t be sure that I have written it correctly.
So, I don’t expect rapid progress on this aspect of the thread.