I would like to add a new feature in picard to remove duplicate files. Like suppose i have a music collection which contains same music multiple times, this will remove multiple files. We discussed something about it at Discussion . Ideas and discussion are welcomed.
Key points from previous discussion (which are not decisions and are up for further debate):
- Three things that need to be decided:
- How to identify files as duplicates of one another
- How to decide which of the duplicate files to keep and which to delete
- When to do the deletion
- There was some consensus (but only by the two contributors Vishi and myself) that identification of duplicate tracks should be on the basis of Track MBID - but personally I now wonder whether perhaps it should also include being in the same directory (or same target directory if saved).
Another possible test would be tracks linked to the same recording.
Regarding deletion , i thought something like, We will have a data structure which will contain MBID , bitrate, pointer to file. If the next music file’s MBID matches with the existing MBID in our structure then we will check which file has lower bitrate we will move that file to temp folder. If it doesn’t matches any , we will store it to test with others
You might have two different albums containing the same recording - would you really want to delete one of these files leaving one of the albums incomplete?
I wish I understood what this means.
I think he is talking about having an album in his library and then adding the same album a second time (which get saved in the same directory as e.g. “1 Love Me Do.mp3” and “1 Love Me Do (1).mp3”).
Of course, a few people are not interested in albums only “singles” or single tracks, in which case recording MBID might be better (i.e. for Non-Album Tracks).
That’s not what I understood when I read his ticket in jira (PICARD-311), which states:
I do have lots of files with doublettes.
I want to have from every song only one file.
With Picard i can identify and sort to releases.
But i dont care about releases, i care about songs.
Wouldnt it be great if Picard would remove all soundfiles which are available on the hard disc more often then one time?
Picard could check for the quality of the identical songs and keep the one with the best quality.
Picard could also check if the kept MP3 file is valid.
@vishichoudhary, could you please clarify? If you have two different releases (albums) with the same recording, are you only wanting to keep one of the files (and leave one release incomplete), or did you want to keep both releases complete, and only get rid of duplicate files within a release?
I don’t want to disrespect @vishichourdary, but this request has existed for some time, and is not just about one user’s requirement.
That said, I thing that using Track MBIDs for Albums and Recording MBIDs for NAT tracks should do it. Presumably anyone wanting to tag songs rather than albums has a tagging script that unsets all album related tags to make every file a NAT song.
Agreed, but I wouldn’t count on everyone having their scripts unset all album-related tags. A lot of people simply don’t know how to do that, and don’t care to learn.
Bizarre as it sounds, I can actually envision someone wanting to just keep one file, but wanting the tags for that file to contain a list of all the albums on which the recording appears.
The few times I’ve had duplicate releases, I’ve run Picard over the collection, then used PowerShell to search for and delete the files that ended with (#).mp3. You can also use file explorer to search for *).mp3, but that’s a little slow and you have to be careful not to delete the songs with parenthesis as part of the titles.
What i was thinking. Suppose we have a track lean_on saved multiple time in a folder with different names.
We can remove those unnecessary files…
IMO it is better to work on new metadata MBIDs downloaded from MB rather than existing tags - because files may not yet be tagged.
Another idea for identifying duplicates - Acoustid!!
Whatever the solution is we should try to make it fit several use cases.
This feels to me like a plugin waiting to be written - with some options:
a. Use track / recording MBIDs / AcoustIDs for dedup.
b. Merge album data from all files into file being kept - which probably implies that we need a function that avoids this data being lost if you update the tags.
c. Options for deciding which file to keep - largest, smallest, newest, oldest …
Is there an automated way to dedupe in picard?
I use DuplicateFilesDeleter. It is really friendly and helpful.
Not sure if this is still being discussed but one option to consider is having Picard do some type of checksum/hash to compare files. This will find files that are virtually identical, so this would be a good default option to enable (if this functionality is ever added). You could then add other options to compare the bitrate, MBID, AccoustID, etc. for when the hashes don’t match.
Now, the checksum should be done without any metadata/tags as those could cause the checksum to be different but I think Picard should be able to do that, since it already has an option to clear existing tags from files. So the duplicate check process could occur after this step and before the new tags are applied.
Windows even has a built-in utility to generate checksums that Picard might be able to hook into:
Or maybe use a tool like Hash My Files from Nirsoft to do it on multiple files/folders.
For a similar discussion see:
Another alternative not so far mentioned is that SongKong does Duplicate deletion based on Acoustids/MusicBrainz Ids, it can find duplicates for the following scenarios:
Same song (Duplicate Recording Id)
Sounds the same (Duplicate Acoustid)
Sounds the same and same song (Duplicate Recording Id, Acoustid)
Same Release, same version (Duplicate TrackId)
Same Release, same version, sounds the same (Duplicate TrackId, AcousId)
Same Release, any version (Duplicate RecordingId, ReleaseGroupId)
Same Release, any version, sounds the same (Duplicate RecordingId, ReleaseGroupId, AcoustId)
You can also configure which duplicate to keep based on audioformat, bitrate, fileCreationDate etc and whether to actually delete or just move the duplicates.