Question about listen identity / deduplication in ListenBrainz ingestion

deepak_80 · January 31, 2026, 4:53pm

Hi, I’ve been reading through the listen submission path in ListenBrainz (submit_listen → insert_payload) and wanted to sanity-check my understanding.

From what I can see at the API layer, there isn’t an explicit concept of “listen identity” or duplicate detection — the payload is augmented and then sent to the queue more or less as-is. I couldn’t find a place where identity or conflict handling is defined there, but I may be missing something downstream.

Is that understanding correct, or is there another layer where this is handled?

If this is accurate, would a small PR that:

refactors the augmentation step for better testability,
adds regression tests to document the current behavior,
and does not change semantics,

be acceptable?

I’d like to make sure I’m aligned with current design before working on anything.

mr_monkey · February 2, 2026, 3:57pm

Hello @deepak_80

You can look at the listens insertion on the database side:

github.com/metabrainz/listenbrainz-server

listenbrainz/listenstore/timescale_listenstore.py

08df47936


      
          def insert(self, listens):
              """
                  Insert a batch of listens. Returns a list of (listened_at, track_name, user_name, user_id) that indicates
                  which rows were inserted into the DB. If the row is not listed in the return values, it was a duplicate.
              """
          
              submit = []
              for listen in listens:
                  submit.append(listen.to_timescale())
          
              query = """
                  WITH inserted_listens AS (
                      INSERT INTO listen (listened_at, user_id, recording_msid, data)
                           VALUES %s
                      ON CONFLICT (listened_at, user_id, recording_msid)
                       DO NOTHING
                        RETURNING listened_at, user_id, recording_msid
                  ), metadata AS (

The gist of it is we deduplicate on a combination of timestamp, user name and recording MSID.
The MSID (MessyBrainz ID) is a representation of the dirty metadata we receive for a listen, and contains track, artist and release name, track number and duration.
It’s the closest thing you’ll find to an “identitiy” of a single listen.

github.com/metabrainz/listenbrainz-server

listenbrainz/messybrainz/init.py

08df47936


      
          msid = uuid.uuid4()  # new msid
          query = text("""
              INSERT INTO messybrainz.submissions (gid, recording, artist_credit, release, track_number, duration)
                   VALUES (:msid, :recording, :artist_credit, :release, :track_number, :duration)
          """)
          connection.execute(query, {
              "msid": msid,
              "recording": recording,
              "artist_credit": artist,
              "release": release,
              "track_number": track_number,
              "duration": duration
          })
          return str(msid)

With all that in mind, what are your proposed changes?

deepak_80 · February 3, 2026, 6:28pm

Thanks, that helps a lot — I had only been looking at the API submission path and missed that the effective “identity” is enforced at the database layer via (listened_at, user_id, recording_msid), with recording_msid representing the messy metadata fingerprint.

Based on that, I’m not proposing to change any deduplication semantics. My intent was limited to the API layer:

refactor the augmentation / submission step to make the data flow clearer and easier to test,
add regression tests that document the current behavior (including that duplicate handling happens downstream), and
keep the observable behavior unchanged.

Concretely, I was thinking of extracting the augmentation logic into a smaller unit that can be tested in isolation and adding tests that assert that the same payload results in the same augmented listens being sent to the queue, leaving duplicate handling entirely to the database layer as it is today.

Does that sound reasonable as a small, non-semantic-change PR, or is there a different area in the submission path you would prefer me to focus on?

mr_monkey · February 4, 2026, 11:51am

I would like to preface this by saying that your response reads like LLM output, and I will point you to our AI use policy, recently added to our contribution guidelines: GitHub - metabrainz/listenbrainz-server: Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.

If you are using an LLM to generate or format these responses, I would much rather hear your authentic voice.
We want you to use your own reasoning skills and show us what you got.

You mention the augmentation step, which actually only adds the user id and name to the listen.
I don’t know how you would “make the data flow clearer and easier to test”.
Unless I am missing sometihng and you can point me more specificly to the code you are talking about, it just seems like a confused LLM suggestion.

add regression tests that document the current behavior

What do you mean by that? Can you be more specific?
Add tests where, to test what?

deepak_80 · February 4, 2026, 11:59am

Thanks for the clarifications, and understood about the AI policy — that’s fair. I’ll be more carefu;.

Part of the reason I was thinking about “identity” or conflicts at this layer is because in another MetaBrainz project (Picard) I recently worked on file identity and external change detection, so I was mentally mapping that idea onto the ListenBrainz submission path. That was my mistake here, I agree that the augmentation step itself is very thin and mainly just adds user id and user name.

deepak_80 · February 4, 2026, 12:03pm

I was mainly trying to understand whether a similar identity, conflict problem existed in the ListenBrainz submission path at all, rather than assuming it did — this is part of exploring the code and learning how the pipeline structure work.