Extremely large music collection needs advice on what dedupe program to use

bsac335 · November 9, 2022, 4:57pm

I have a very large music collection that needs deduping. Can someone recommend a music dedupe program that can tackle this issue?

Please advise.

derat · November 9, 2022, 6:58pm

There are some earlier discussions here about using Picard to identify duplicates:

I’m not sure if Picard will be able to load your whole collection into memory, though, and I suspect that looking up all the songs may take a long time if they aren’t already tagged with MBIDs. How large is “extremely large”?

If you’re looking for acoustic similarity, I wrote a program named soundalike that uses AcoustID’s chromaprint to scan a music collection. I described it a bit at Soundalike: a program for finding duplicate recordings in a music collection.

I run it periodically on a slow computer against a collection that’s currently just over 23,000 songs. The initial scan took about an hour but incremental scans are pretty fast.

If you give it a try, let me know if it works for you (or doesn’t).

bsac335 · November 9, 2022, 9:37pm

As Dr. Evil says…I have “One Million” dupes

ulugabi · November 10, 2022, 12:00am

dupeGuru
https://dupeguru.voltaicideas.net/
You can use it by filename, tags or generating a fingerprint.

bsac335 · November 12, 2022, 12:20am

Thanks but I could not for the life of me figure out the interface for DupeGuru. So, i bought Duplicate Cleaner 5 and I’m happy with it. THe UI is much better plus it has better features.

dpr · November 12, 2022, 2:52pm

I’m not familiar with it. I’ve had some success with dupeguru. It would be interesting to hear how using has worked for you and any things you learned

InvisibleMan78 · November 12, 2022, 3:50pm

Just for the record:
Can you confirm that DC5 is using some kind of acoustic fingerprinting (does it “hear the songs”) to detect and compare the same song in different formats?
Or does it read and compare the available metadata inside the music files in addition to technical informations like bitrate or length?

derat · November 13, 2022, 12:06pm

The docs at Audio mode make it sound like it supports using acoustic similarity:

This mode can match by embedded audio tags (Song name, Artist, etc) or by comparing the music data for similarities.

…

This menu option controls the type of matching used when searching for audio duplicates. There are several preset modes:

Match exact audio data (ignore tags)

Exact audio data is compared, ignoring tags. The audio must match exactly - with identical format, length and compression.

Similar - Compare first 2 minutes

The first two minutes of audio is compared for similarity. Small differences in length, quality and content are acceptable. Tags are ignored.

Similar - Compare full file (slower)

The full audio files are compared for similarity. Small differences in length, quality and content are acceptable. Tags are ignored.

Similar - Quick match (compare first 15 seconds)

The first 15 seconds of audio is compared for similarity. Small differences in length, quality and content are acceptable. Tags are ignored.

Similar audio - Custom

Allows custom settings for similar audio fingerprint matching.

Ignore content - match by tags or attributes

Files are matched using only their audio metadata (artist, title, etc). Useful on a fully tagged collection. Set the matching criteria in the Audio tags section (see below).

Note:
The similar audio matching settings will not work on audio files of less than 3 seconds.

I didn’t see any details about what it’s doing under-the-hood, though.

bsac335 · November 16, 2022, 3:35pm

I gave up on DupeGuru because I couldn’t figure the UI out and my UX was so bad I was Happy to purchase Duplicate Cleaner 5 which has a superior interface and more features.

jwynn6 · November 17, 2022, 10:35pm

I’ve been using Czkawka with great success. It’s an open source tool and very fast.

InvisibleMan78 · November 17, 2022, 11:27pm

At first sight Czkawka on Github seems only to look for existing metadata (like Title, Artist, Year, Bitrate, Genre or Length) to compare “music” similarity?

Same Music

This is a mode to find identical music files through tags.