GSoC proposal for file deduplication in Picard

I think Picard could be extended in two ways to allows for music deduplication:

  1. Using a combination of AcoustIDs, MBIDs.
  2. Using a simple check if the destination (when the move option is enabled) already has a file of the same name. This situation currently causes Picard to transparently rename the file with a numerical suffix.

I propose that we allow a deduplicate action which:

  1. Asks for the root directory on which to traverse.
  2. Iterates over all files (maybe skip files without MBIDs to improve speed - will need to actually test a prototype) and creates a sqlite DB (or simple plaintext csv) with the filename, path, AcoustID and MBID.
  3. Now we can perform deduplication using a scoring algorithm (will need to design this and write a spec):
  • High bitrate
  • More tags compared to other candidates etc.

As for the second case, the fix is very simple by providing a option in the move files page to either “warn” or “silently continue” when such a case arises.

I’d like to discuss this idea further and flesh it out into a proper detailed proposal.

@Zas, @Bitmap

2 Likes