Hi, I’ve been reading through the listen submission path in ListenBrainz (submit_listen → insert_payload) and wanted to sanity-check my understanding.
From what I can see at the API layer, there isn’t an explicit concept of “listen identity” or duplicate detection — the payload is augmented and then sent to the queue more or less as-is. I couldn’t find a place where identity or conflict handling is defined there, but I may be missing something downstream.
Is that understanding correct, or is there another layer where this is handled?
If this is accurate, would a small PR that:
refactors the augmentation step for better testability,
adds regression tests to document the current behavior,
and does not change semantics,
be acceptable?
I’d like to make sure I’m aligned with current design before working on anything.
You can look at the listens insertion on the database side:
The gist of it is we deduplicate on a combination of timestamp, user name and recording MSID.
The MSID (MessyBrainz ID) is a representation of the dirty metadata we receive for a listen, and contains track, artist and release name, track number and duration.
It’s the closest thing you’ll find to an “identitiy” of a single listen.
With all that in mind, what are your proposed changes?
Thanks, that helps a lot — I had only been looking at the API submission path and missed that the effective “identity” is enforced at the database layer via (listened_at, user_id, recording_msid), with recording_msid representing the messy metadata fingerprint.
Based on that, I’m not proposing to change any deduplication semantics. My intent was limited to the API layer:
refactor the augmentation / submission step to make the data flow clearer and easier to test,
add regression tests that document the current behavior (including that duplicate handling happens downstream), and
keep the observable behavior unchanged.
Concretely, I was thinking of extracting the augmentation logic into a smaller unit that can be tested in isolation and adding tests that assert that the same payload results in the same augmented listens being sent to the queue, leaving duplicate handling entirely to the database layer as it is today.
Does that sound reasonable as a small, non-semantic-change PR, or is there a different area in the submission path you would prefer me to focus on?
If you are using an LLM to generate or format these responses, I would much rather hear your authentic voice.
We want you to use your own reasoning skills and show us what you got.
You mention the augmentation step, which actually only adds the user id and name to the listen.
I don’t know how you would “make the data flow clearer and easier to test”.
Unless I am missing sometihng and you can point me more specificly to the code you are talking about, it just seems like a confused LLM suggestion.
add regression tests that document the current behavior
What do you mean by that? Can you be more specific?
Add tests where, to test what?
Thanks for the clarifications, and understood about the AI policy — that’s fair. I’ll be more carefu;.
Part of the reason I was thinking about “identity” or conflicts at this layer is because in another MetaBrainz project (Picard) I recently worked on file identity and external change detection, so I was mentally mapping that idea onto the ListenBrainz submission path. That was my mistake here, I agree that the augmentation step itself is very thin and mainly just adds user id and user name.
I was mainly trying to understand whether a similar identity, conflict problem existed in the ListenBrainz submission path at all, rather than assuming it did — this is part of exploring the code and learning how the pipeline structure work.