How to proceed with large libraries?

Recently I’ve cancelled my Google Music subscription and went back to good old music-files-on-hard-drive approach. I took my HDDs from various years from the vault and copied media files there.

There are ~25000 tracks in total, and now I want to clean/sort this collection. I hear the recommendation to do that in batches of 5000 files or so, but what should be exactly the procedure?

Initially I tried to run it with all 25000 files. It took 5 hours for initial processing, and then after 20 minutes of manual sorting the Picard app hanged (when I right-clicked the changes pane and selected “Show changes first” checkbox – it’s a separate issue, and I can reproduce it, e.g. when “Clustered tracks” is selected in the left pane).

My intention was to have a clean directory as an output.

As the music comes from multiple locations, it has lots of overlaps, and it would be nice to reconcile e.g. partial albums from those different epochs…

And the question finally come here:

What’s the best way to do the sorting/tagging/de-duplication in batches?
Suppose I sort and tag the first 5000 tracks from the first batch.
Will the second batch of 5000 tracks automatically detect duplicates in destination directory? Or I should add the destination directory as source too? Or I should do that and have a new clean location?

1 Like

IMHO, it highly depends on your existing tags inside your tracks.
If you have no - or only very few - tags, your manual effort will be between huge and and even a lot more of time.
If you have at least the basic tags like Artist, Album, Year, Title filled correctly in your tracks, you can do it with a still big effort. You have to manually check album by album with the help of Picard.

Best case would be if you have some unique identifier inside your tracks, like one or more of the MB identifier for a track, album, artist and so on. This way Picard can help you most.

So please have a look inside your tracks and let us know, what kind of metadata you already find?


Using beets for the first pass can save a lot of time. beets is an automatic tagger with a command-line interface. It queries MusicBrainz, but it can also look at other sources (such as Discogs) if you install the plugins.

I would point beets at the entire collection, and set it to auto-tag albums that are a near-certain match (the threshold is adjustable). If you know where most of your tracks came from, you can improve the matching accuracy even more by setting your preferred release country and media (CD, digital media, etc.) in yours beets config file.

The processing is pretty fast, and you can watch its progress. You can also stop and resume the whole tagging job. When it finishes, use Picard to make adjustments and to tag the remainder of your collection. beets makes that job much smaller by automating the easy decisions.

Here’s how beets handles duplicates. There is also a Duplicates plugin that you can use to audit your collection. These may or may not meet your needs depending on what you consider to be a duplicate. See also this thread and its links.

Even though Picard and beets detect files that already have MusicBrainz tags, telling the tagger to move processed files into a new directory structure will probably help you manage the job. Keeping your inbox separate is a more efficient way to work.


Tracks or albums is another important question.

In any case, whatever you do, Picard or other program, try it on a backup of some of your files first!

Overwritten metadata is gone forever.


Thanks for your suggestions.
I did play with a program a bit more, and realized that it’s probably better to do some manual preprocessing first, e.g. manually sort tracks by artists, and then add artists one by one.

I did try beets, but it wasn’t of much help either. It’s pretty slow and it’s hard to control (i.e. when there is a folder with an album, but one of the tracks is completely wrong, there’s no way to do anything about it). Picard is much better in this regards, you can go back and forth in any order and there’s no any implied flow.

Duplicates indeed seems to be an issue, it’s strange that there’s nothing in Picard to handle it.
Even with the most simple case when I have the same album in two locations and Picard sees that, there’s no option to keep just one copy of every track, you have to remove tracks one by one.

I’ll probably write a plugin for that (isn’t there already one?) which will remove all variants of a track except the “best” one (by match or bitrate).

Another plugin that I think of doing (if API allows that), is to send all non-perfect AcousticId matches of an album back to Unclustered section.
It often happens that e.g. live versions songs are added to studio albums as lower-confidence-match, and it’s quite tedious to drag them back one by one, better to do that in one click.


That’s already builtin basically. To tweak it:

  • In Options > Metadata > Preferred Releases raise the preference for albums, lower it for compilations
  • To reject lower quality matches you can raise the file matching threshold under Options > Advanced.

You have to play with the threshold parameter a bit to see which value works best for you. If your existing basic metadata (album and track title and artist mainly) is already rather good and not too different from MB you might get good results with a higher threshold. If you start getting no matches, lower the threshold again.