Remove duplicate files with Picard

feature
picard
Tags: #<Tag:0x00007f23c5540f70> #<Tag:0x00007f23c5540de0>

#1

I would like to add a new feature in picard to remove duplicate files. Like suppose i have a music collection which contains same music multiple times, this will remove multiple files. We discussed something about it at Discussion . Ideas and discussion are welcomed.


#2

Key points from previous discussion (which are not decisions and are up for further debate):

  1. Three things that need to be decided:
  • How to identify files as duplicates of one another
  • How to decide which of the duplicate files to keep and which to delete
  • When to do the deletion
  1. There was some consensus (but only by the two contributors Vishi and myself) that identification of duplicate tracks should be on the basis of Track MBID - but personally I now wonder whether perhaps it should also include being in the same directory (or same target directory if saved).

#3

Another possible test would be tracks linked to the same recording.


#4

Regarding deletion , i thought something like, We will have a data structure which will contain MBID , bitrate, pointer to file. If the next music file’s MBID matches with the existing MBID in our structure then we will check which file has lower bitrate we will move that file to temp folder. If it doesn’t matches any , we will store it to test with others


#5

You might have two different albums containing the same recording - would you really want to delete one of these files leaving one of the albums incomplete?

I wish I understood what this means. :thinking:


#6

I wouldn’t, but I thought that’s what @vishichoudhary had in mind.


#7

I think he is talking about having an album in his library and then adding the same album a second time (which get saved in the same directory as e.g. “1 Love Me Do.mp3” and “1 Love Me Do (1).mp3”).

Of course, a few people are not interested in albums only “singles” or single tracks, in which case recording MBID might be better (i.e. for Non-Album Tracks).


#8

That’s not what I understood when I read his ticket in jira (PICARD-311), which states:

I do have lots of files with doublettes.
I want to have from every song only one file.
With Picard i can identify and sort to releases.
But i dont care about releases, i care about songs.
Wouldnt it be great if Picard would remove all soundfiles which are available on the hard disc more often then one time?
Picard could check for the quality of the identical songs and keep the one with the best quality.
Picard could also check if the kept MP3 file is valid.

@vishichoudhary, could you please clarify? If you have two different releases (albums) with the same recording, are you only wanting to keep one of the files (and leave one release incomplete), or did you want to keep both releases complete, and only get rid of duplicate files within a release?


#9

I don’t want to disrespect @vishichourdary, but this request has existed for some time, and is not just about one user’s requirement.

That said, I thing that using Track MBIDs for Albums and Recording MBIDs for NAT tracks should do it. Presumably anyone wanting to tag songs rather than albums has a tagging script that unsets all album related tags to make every file a NAT song.


#10

Agreed, but I wouldn’t count on everyone having their scripts unset all album-related tags. A lot of people simply don’t know how to do that, and don’t care to learn.

Bizarre as it sounds, I can actually envision someone wanting to just keep one file, but wanting the tags for that file to contain a list of all the albums on which the recording appears. :slight_smile:


#11

The few times I’ve had duplicate releases, I’ve run Picard over the collection, then used PowerShell to search for and delete the files that ended with (#).mp3. You can also use file explorer to search for *).mp3, but that’s a little slow and you have to be careful not to delete the songs with parenthesis as part of the titles.


#12

What i was thinking. Suppose we have a track lean_on saved multiple time in a folder with different names.
We can remove those unnecessary files…


#13

IMO it is better to work on new metadata MBIDs downloaded from MB rather than existing tags - because files may not yet be tagged.


#14

Another idea for identifying duplicates - Acoustid!!

Whatever the solution is we should try to make it fit several use cases.

This feels to me like a plugin waiting to be written - with some options:

a. Use track / recording MBIDs / AcoustIDs for dedup.
b. Merge album data from all files into file being kept - which probably implies that we need a function that avoids this data being lost if you update the tags.
c. Options for deciding which file to keep - largest, smallest, newest, oldest …