Detecting bad actors/AcoustID submission groups

aerozol · September 22, 2021, 8:49pm

I actually think this is a good example of where a bot/the system can fix these.

If it’s one user submitting all this junk, as you’ve surmised, then we can detect them as a bad actor (after a certain amount of percentage of unlinks by editors) and remove all their submissions. Assuming that we’re happy to drop a whole bunch of valid IDs/we think a incorrect link is more trouble than 100 correct ones is worth (I think so tbh). Nothing too complicated I think.

IvanDobsky · September 22, 2021, 9:23pm

Are submissions always linked to a user? There seem to be so many different routes to adding data to AcoustID.

You have a simple example in that one. Identify who uploaded those Depeche Mode tracks and I’ll bet you have the same name six times.

I am also curious - will that resolve to a username? Or an anonymous ID number?

But surely to spot this one “bad user”, you still need a human to say “that ain’t right, put that user on a naughty list”?

My assumption was not that one user was uploading all bad data. Just they got that one release wrong a number of times. Maybe tried six times before they realised where their error was, and after fixing it went on to add hundreds of good quality releases.

After seeing two bad tracks, it would have been good to have said “clear that user’s additions for this Release”, but that assumption cannot spread beyond one release without checking.

aerozol · September 22, 2021, 10:02pm

No idea

Just some thoughts re. where automation could theoretically help…

You’ve done that, by unlinking their submission. My proposal is that if a user has had X amount or % unlinked (as you say, this requires a good human eye), the system unlinks all of their submissions.

Assuming: We can link users/IDs with submissions. And we are happy to throw away X amount of good submissions.

The reason I think this is a good idea is because it’s often mentioned that a real issue with AcoustID is someone with a crazy amount of files coming in and just hitting save and submit on everything… imo, once identified, we could just take them out of the equation, if their submission suck

IvanDobsky · September 22, 2021, 11:05pm

I think you would be throwing away too much just because someone had a bad day. I know the first few times I used AcoustID I got it wrong.

Now if you can match an ALBUM or a short Time Period, then that could work. I just can’t see how you could write off a whole person’s data. Too much. Everyone has bad days.

There are better patterns to work on (and we are getting OT and having this conversation in the wrong thread)

aerozol · September 23, 2021, 12:54am

Continuing the convo here:

I mean, I’m not proposing we just scrub anyone who has bad additions… I definitely don’t want to throw out the baby with the bathwater/shitty AcoustID’s.

More productively, what threshold would you think is throwing out too much? What percentage of good submissions? And how long of a grace period for a user ID (if there is even such a thing attached to submissions) before it should be checked?

e.g. a user ID’s submissions are all 5 years old, they have 1,000 submissions, and 50% of them have, incredibly, been unlinked by very attentive and busy users such as yourself. Would this hit the threshold for you to assume their submissions suuuuuuuck? Note that any AcoustID with even a single submission count from another user ID would still stay.

InvisibleMan78 · September 23, 2021, 7:30am

Please don’t forget: AcoustID accepts contributions from various sources, not only from MusicBrainz/Picard. Picard for example “validates” the submissions to AcoustID where other 3rd party applications don’t care at all.

Would it not be much easier, if Picard refines his search and use of AcoustID’s? I can’t tell you exactly what numbers or dependencies would help most. But this way MB could act on its own. (I’m not sure how fast AcoustID would introduce tests to increase the data quality or detect and reject users as in your example.)

ijabz · September 23, 2021, 8:19am

What is the validation ?

You can do check at the application end, but the more checks you do the more calls have to be made to the rate limited database and there comes a point where things get too slow, also many of these cases would be too difficult to pick up.

Much better to fix once at source, then have to handle the issue every time that track gets processed by Picard. And of course fixing the data is a solution for all MusicBrainz/Acoustid users not just Picard users.

InvisibleMan78 · September 23, 2021, 8:44am

I suppose @outsidecontext can explain this steps most accurately.

In theory: sure. But waiting for such a fix at the source for years doesn’t help you or anybody else.

ijabz · September 23, 2021, 8:46am

But thats what Ive done with this report, we can fix it now.

MrClon · September 23, 2021, 11:39am

There is way to add data to AcoustID bypass AcoustID API? API require user and client API keys which linked to AcoustID user

InvisibleMan78 · September 23, 2021, 12:42pm

According to the Acoustid blog page, it seems like there is/was some kind of:

There are also two new methods /v2/user/lookup?user=X and /v2/user/create_anonymous?client=X&clientversion;=X , to support anonymous user accounts. They can be used from applications that can’t ask the users to log in on the Acoustid website, but there are a number of rules that such applications should follow:

You find the Rules here.

IvanDobsky · September 23, 2021, 5:53pm

Who change the title to something attacking people? Totally misses the point of the original discussion.

/unubscribed.

ijabz · September 24, 2021, 6:20am

It was @freso i think, I don’t really understand the need to split the topic

aerozol · September 24, 2021, 9:43pm

The title wasn’t changed, Freso moved this part of the discussion out into a completely new thread

I just realized I can edit the title! So have done so.

Why not? That’s the aim. What variables would capture this set of submissions?

The simplest scenario: ID X submitted 100 Stephen King AcoustID’s 10 years ago. ID X has never been active again, they submitted nothing else. How large a % of the AcoustID’s from that session would you manually want to unlink before the system decides to auto unlink the rest?

Complex scenario: ID Y submitted their whole library of 10k files in one go 3 months ago, without checking anything. 10% were bad, including their hastily tagged Stephen King collection. Since then they have continued adding acoustIDs for new additions, which are tagged correctly. Is there a threshold where we remove the 10k hurried submissions (anything submitted by that ID on that date for instance), to save you x hours having to clean Stephen King entries? And accept that there are good ones that will be removed*?

Your answer may well be “no”, which is totally understandable. Personally I think bad submissions outweigh a lot of good ones. I am not talking about permanently banning or besmirching key/ID 255267’s good name btw. Just questioning how we might be able to use what info we have to make the DB more reliable overall.

Note: This more nuanced session based approach relies on AcoustID storing submission timestamps as well as IDs. Otherwise it would be a more crude purely % based approach, which might not be acceptable.

*even the laziest Picard clicker is going to struggle to mistag a entire collection completely! I would be very impressed 10% is huge tbh.

IvanDobsky · September 24, 2021, 10:55pm

I would be concerned if it was too easy to pull out the user IDs from submissions. That kinda data gets personal, but I don’t want to go on a tangent or someone will split the thread again.

Timestamps would be far more useful. “IDs from a specific date that matches a named artist”. or “IDs added to a specified release on a specific date”. That would allow a net to be spun around bad data without picking on a specific person.

There are way too many duff acoustIDs from Audiobooks for it to be only one person doing it. That person would have to have a large collection. Something is screwy with audiobooks. Or maybe they just stand out more

aerozol · September 25, 2021, 2:52am

Just to clarify, I don’t imagine these unlinks would ever be linked to a MusicBrainz editor. It didn’t even occur to me tbh.

It would just be behind the scenes cleanup when a certain threshold is hit within certain parameters.

Side note, has there ever been an example of someone purposefully mass adding/vandalising with incorrect AcoustIDs? Would we have any way of knowing/finding out, as it stands?

IvanDobsky · September 25, 2021, 10:17am

How about this for a classic audiobook mess? Release “The Fortune of War” by Patrick O’Brian - MusicBrainz

NOTHING on there seems to be linked to the book. You can tell as there are no consistent runs of numbers. This is an example of a Release that just needs “all AcoustIDs unlinked”. But at 9CDs that is too much to do by hand.

Actually, now I look closer, acoustIDs only get to CD3 and then stop.

jesus2099 · September 25, 2021, 10:33am

I have seen AcoustID in the last medium.
It seems you are using INLINE STUFF, but it’s broken at the moment, sorry.
Maybe I should think about a status page or system inside scripts themselves, when I know they are broken.

IvanDobsky · September 25, 2021, 10:51am

I see that as the site is borken and not letting your INLINE STUFF run properly. No need to apologise as your scripts add so much that is missing from the site GUI. Wish I could just turn off that annoying default collapse thing as it also make a mess of the browser’s “Find in Page” and so many other items.

I fix many of these AcoustID dups thanks to your pink highlighted AcoustIDs in INLINE STUFF

spUdux · September 25, 2021, 11:40am

Same here, super useful for spotting and fixing obvious errors! All glory to @jesus2099 !
BTW, as a workaround for big multi-disc releases: open just a single medium spread at a time like https://beta.musicbrainz.org/release/16caa039-aba9-45e8-9ff1-8cca8635ff52/disc/12#disc12
(most often i just r-click on the medium link and open it in a new tab or window)