Soundalike: a program for finding duplicate recordings in a music collection

derat · September 26, 2022, 8:35pm

Earlier this year, I wrote a command-line program unimaginatively named soundalike that uses chromaprint’s fpcalc utility to look for duplicate recordings within my music collection without talking to external services like AcoustID. It’s been working pretty well, so I figured I should finally get it into a state where other people can try it out.

The basic idea is that the program fingerprints all the audio files under a directory and then prints clusters of similar files:

% ./soundalike testdata
2022/09/26 16:03:23 Finished scanning 6 files
64/Fanfare for Space.mp3    0.47 MB  61.49 sec
orig/Fanfare for Space.mp3  2.35 MB  61.44 sec

64/Honey Bee.mp3    0.44 MB  57.16 sec
orig/Honey Bee.mp3  2.18 MB  57.10 sec

You can also do one-off comparisons between two files:

% soundalike -compare -compare-interval 100 instrumental.mp3 vocals.mp3
 100: 0.983
 200: 0.984
 300: 0.983
 400: 0.989
 500: 0.819
 600: 0.757
 700: 0.751
 800: 0.753
...

There are various flags for tweaking thresholds, ignoring false positives in later runs, etc.

If you’re interested in trying it out, I’ve uploaded precompiled binaries for Linux and Windows (x86-64) at https://github.com/derat/soundalike/releases/tag/v0.1. All of my usage so far has been on Linux, although I verified that it at least runs on Windows 10. I don’t know of any reason why it wouldn’t work on macOS, but there’s no precompiled version (yet?) since cross-compiling seems like a nightmare and I wouldn’t have a good way of testing it.

I’ve tried to document how to install and use the program in the README.md file. Note in particular that you’ll probably need to copy fpcalc or fpcalc.exe from a Chromaprint release to a location where soundalike can find it.

And if you run into any issues or have suggestions, please feel free to create an issue!

Anesidora · September 27, 2022, 2:33am

This is pretty cool. I wonder if you could measure the difference as permutation distance (as done in graph theory) to help you distinguish slight differences (like a different fade-out) from bigger differences (like a radio edit or a track in a beatmixed set).

This post explains how Endcrawl used this approach to measure differences in the order of on-screen credits at the end of different films.

derat · September 27, 2022, 1:23pm

Thanks, that Endcrawl post is interesting! I’m not sure how feasible it would be to use a similar approach when comparing recordings, though – I can’t think of any way to break a fingerprint up into nodes (analogous to the credits in the post) that could be put into a graph. Offset differences at the beginning of songs already make similarity comparisons pretty challenging, to the point where my code essentially just says ¯\_(ツ)_/¯ and brute-forces things by trying all possible alignments.

Just for fun, I tried comparing get lucky but beats 2 and 4 are swapped [CC] - YouTube against a 320 kbps CBR MP3 of the original recording (Daft Punk - Get Lucky (Official Audio) ft. Pharrell Williams, Nile Rodgers - YouTube). The similarity was only 0.424, though – I suspect that detecting swapped-beat remixes may not have been one of Chromaprint’s design goals.