I've identified the right MusicBrainz track IDs and release IDs for many tracks. Now how do I remove all the duplicates?

bisonmammoth · March 29, 2023, 9:52pm

I’ve used Picard to identify the right track IDs for a lot of my music library. Now I want to get to work removing duplicates throughout. However, it doesn’t seem like Picard has a duplicate (or triplicate, or quadruplicate… ) removal tool.

Among tracks with MusicBrainz Track Id populated, I want to consolidate multiple files down to one file to save hard drive space and generally clean up my library. Ideally, I’d keep the largest (theoretically the best?) version of the file.

I found Duplicates Plugin — beets 1.6.0 documentation but then I would need to import into beets and that takes a lot of finagling. I wish there were some way I could just import into beets only tracks that have the MusicBrainz track id, but instead beets looks at info like album name and title, artist etc and still gets confused.

derat · March 29, 2023, 11:18pm

Which operating system are you using? If you just want to identify songs with duplicate track IDs (or recording IDs?), the easiest approach would probably be a short script that calls a program to print each file’s ID (e.g. mutagen-inspect) and then builds up a dictionary to track the best file for each ID.

bisonmammoth · March 30, 2023, 2:16am

I’m using Linux/Ubuntu 22.04

IvanDobsky · March 30, 2023, 8:40am

dupeGuru

I’ve not used it yet, but was about to give it a run on a selection of photos doubled up by cloud syncs. That has a music mode that should help your task.

(The obvious comment - make a backup…)

I assume this is a “heap of tracks” type situation and not full albums.

derat · March 30, 2023, 11:12am

If everything is tagged properly, I think that the following should print the smaller versions of files that have duplicate recording IDs (as contained in their UFID frames). You’ll need to have the python3-mutagen package installed.

#!/usr/bin/env python3

import os
import re
import subprocess
import sys

if len(sys.argv) != 2:
    print("Usage: %s <dir>" % sys.argv[0])
    sys.exit(2)

class FileInfo:
    def __init__(self, path, mbid, size):
        self.path = path
        self.mbid = mbid
        self.size = size

# MBID to FileInfo of largest file seen so far.
best = {}

# FileInfos corresponding to smaller files.
dupes = []

for root, dirs, files in os.walk(sys.argv[1]):
    for fn in files:
        p = os.path.join(root, fn)
        try:
            out = subprocess.check_output(['mutagen-inspect', p], text=True)
        except subprocess.CalledProcessError as e:
            print("Failed running mutagen-inspect on %s: %s" % (p, e))
            continue

        # Extract the MBID from a line like the following:
        # UFID=http://musicbrainz.org=b'288237d3-b4f1-43a0-8f41-c37f276a0fca'
        match = re.search(
                "^UFID=http://musicbrainz\.org=b'([-a-f0-9]+)'$",
                out, flags=re.MULTILINE)
        if not match:
            print("Failed getting MBID for %s" % p)
            continue
        mbid = match[1]

        fi = FileInfo(p, mbid, os.stat(p).st_size)

        prev = best.get(mbid, None)
        if not prev:
            best[mbid] = fi 
        elif fi.size > prev.size:
            best[mbid] = fi 
            dupes.append(prev)
        else:
            dupes.append(fi)

if len(dupes) > 0:
    print("Duplicate files:")
    for fi in dupes:
        print("%s (%s)" % (fi.path, fi.mbid))

It doesn’t delete files, so you should be able to check the results yourself before doing anything.

If you want to use track IDs instead of recording IDs, you’ll need to change the regular expression. The files that I looked at don’t appear to contain track IDs, so I wasn’t sure offhand where Picard would store them.

(Disclaimer: it’s been a long time since I’ve written much Python and I only tested this a tiny bit.)