Trying to avoid excessive polling of api

rhbecker · May 30, 2016, 8:05am

Hopefully this all makes sense, but please don’t hesitate to ask for clarifications! Thanks in advance to anyone who actually reads this.

context

I’m assessing whether to use musicbrainz for a project, still at proof-of-concept stage. I’d only be using a very small subset of the data (the scope of which will be pretty obvious if you review my editing history).

My present thinking on approach is to request the data I need once, via api, caching the responses into a local document store that my app would then use (instead of repeatedly hitting the api directly). This would improve the performance of my app and would essentially eliminate any impact of my app on the seemingly always stressed musicbrainz servers.

notification of change

The challenge of the above approach is that I’d need to know when the upstream data changes so I can refresh my locally cached copy. I could accomplish that with regularly scheduled polling, but that would not be a great solution for me or you.

Ideally, you’d tell me that, behind the scenes, your architecture is aligned with some standard messaging pattern - say, publish/subscribe - and that my client app could subscribe to pushed notifications of change, providing the trigger for when my cached version of data needs refreshing.

From what I’ve seen in dev-oriented docs, that does not seem to be the case. And given your limited resources, I can’t imagine a switch to such an architecture arriving anytime soon, even if you recognize the value (honestly, I think such a switch would go a long way towards helping you solve your present scaling issues, and in a way that isn’t just throwing more money at the problem).

existing mechanisms?

More realistically, I’m hoping that some existing mechanism could be leveraged in a more modest and less impactful polling approach.

For example, I’ve played around with user collections, and they seem promising. I get an email alerting me of edits to resources I’ve saved into my collections. I also see human readable interfaces like (not a real URL) https://musicbrainz.org/collection/{collection-id}/edits. Is the content of such a URL available in a machine readable format - ideally JSON?

If I could periodically make a single request to retrieve this “changelog”, my subsequent requests could then be targeted to retrieve only what is known to have changed, rather than blindly checking each resource to determine whether it has changed and needs refreshing.

other ideas?

Maybe someone has an alternative idea? Maybe even already successfully accomplished what I’m after?

andreivolgin · May 30, 2016, 8:46am

I am in exactly the same position. Using JSON API live is not an option due to its performance constraints. Caching data is the only feasible option but it requires regular batch updates. I just ran a small test update (~20,000 releases), and I had to spread it over 2 days to stay within the rate limit. How can I update a few hundred thousand entities?

I assume it should not be too much of a challenge to add a few log statements when a key entity is changed (at a minimum, “artist”, “release” and “recording”). If you simply expose this change log, we can take it from there - parsing logs is trivial.

This will help prevent millions of unnecessary API calls, freeing bandwidth for editors.

If it helps, I can give you free cloud storage space (Google Cloud Storage) for such logs, and I can write a simple public JSON API for anyone to use - without touching your servers or bandwidth.

derobert · May 30, 2016, 8:56am

I would guess your best bet is to use the Live Data Feed … you could watch the replication stream for the stuff you’re interested in. Or… maybe you’d be better served by setting up a full replica. Depends on what portion of the data you need, I suppose.

rhbecker · May 30, 2016, 5:52pm

@derobert: Are you seeing content on the page you linked? I get a “Page Not Found”. I searched the documentation for that phrase and your link shows up as the top result, including a little teaser snippet, as though that page should exist.

At least based on what I imagine to be described by that page, I’m really hoping that’s not the best we can do for the use case I describe. Given that the particular changes I care about are already systematically cherry picked and presented via human-readable markup …

… I’m counting on a machine-readable version of that content as the floor for what is possible.

derobert · May 30, 2016, 6:17pm

Odd, http://musicbrainz.org/doc/Live_Data_Feed loads fine for me. No 404 here.

rhbecker · May 30, 2016, 6:49pm

I can see the content now - assumptions confirmed.

I appreciate you advancing the conversation, but I’ll be severely disappointed if that ends up as my “best bet”.

Keep in mind:

A tiny tiny fraction of the data in musicbrainz is actually relevant to my project (so any solution requiring me to replicate the entire database would seem excessive, given the existence of the api).
The user collection option I speculated about in my first post “proves” that the logic already exists to produce the content I need. It’s just not clear to me whether this content is available in a machine readable format.

Is it clear that supporting my use case would open up some options that might alleviate some of the stress the servers are currently under?

rhbecker · May 30, 2016, 8:01pm

Upon further review of the API documentation, one option could be the introduction of an edit resource.

Each edit appears to already be a discretely identifiable resource - e.g. http://musicbrainz.org/edit/37148588. And, based on the collection UIs I’ve already referenced, it seems like all of the data is properly hooked up in the schema to associate edits with collections.

I could then presumably execute a “browse” request for edits related to a collection, just like I can execute a “browse” request for works related to a collection …

already supported: works in a collection

http://musicbrainz.org/ws/2/work?collection=19d9574a-22ff-4bbf-919b-39ea379f854a

proposed: edits to a collection

http://musicbrainz.org/ws/2/edit?collection=19d9574a-22ff-4bbf-919b-39ea379f854a

aerozol · May 30, 2016, 9:58pm

The technical issues here are way over my head sorry! But a (not very helpful) note -

MB is open source and non-profit. If there’s any seemingly obvious or basic elements that you expect to be here, they can be added by you or someone you know.
Criticism and feedback is always welcome, but, unlike other companies, you can’t really throw your hands into the air and say “why does MB not have this!”. There’s not really anybody or anything to be “disappointed” in that doesn’t include yourself. Makes life much harder

Again, I don’t want to silence your criticism and feedback, people are paying attention! But it’s important to keep things in perspective
Cheers

rhbecker · May 30, 2016, 10:10pm

@aerozol: Understood. I’m not sure which aspects of my posts seem overly critical from your perspective. Perhaps …

To clarify, my disappointment in that scenario is purely hypothetical, as I’m fairly certain there’s more direct ways to accomplish what I’m after. So, if that ends up as my “best bet”, that would seemingly indicate a more fundamental problem, such as an unanticipated architectural obstacle, some major flaw in schema design, or disinterest from the community in improving the api in a way that could have a positive impact on server load. Given that I don’t expect any of that to happen, any of that happening would be disappointing.

rhbecker · May 30, 2016, 10:15pm

Also, I recognize this is a possibility. Part of the point of my post was to determine whether there may be some existing functionality that I haven’t yet discovered that would satisfy my needs. Just as the core team has to be cautious with how they invest their time, I too need to ensure I’m not reinventing the wheel.

andreivolgin · May 30, 2016, 11:41pm

@rhbecker - I just went through the exercise of setting up my own copy of MB server. It took 2 hours. I confirmed that I can run the same JSON API calls against my own copy. This is a game-changer for me, as I can now run jobs 10-30 times faster compared to using a public API based on my initial tests.

I understand that this solution maybe an overkill for your needs, but for me these 2 hours were definitely a good investment.

rhbecker · May 30, 2016, 11:49pm

Cool, thanks for sharing @andreivolgin. If I do end up needing to contribute back code, as @aerozol suggested, I’ll inevitably need to take that plunge to test my work. ~~May I ask which route you went with to set it up?~~ I see in the other thread that you’re likely using this method:

github.com

metabrainz/musicbrainz-server/blob/master/INSTALL.md

Installing MusicBrainz Server
=============================

The easiest method of installing a local MusicBrainz Server may be to download the
[pre-configured virtual machine](https://musicbrainz.org/doc/MusicBrainz_Server/Setup),
if there is a current image available. In case you only need a replicated
database, you should consider using [mbslave](https://bitbucket.org/lalinsky/mbslave).

If you want to manually set up MusicBrainz Server from source, read on!

Prerequisites
-------------

1.  A Unix based operating system

    The MusicBrainz development team uses a variety of Linux distributions, but
    Mac OS X will work just fine, if you're prepared to potentially jump through
    some hoops. If you are running Windows we recommend you set up a Ubuntu virtual
    machine.

This file has been truncated. show original

dns_server · May 31, 2016, 12:05am

Musicbrainz does send out email based on subscriptions. So potentially you could parse these emails looking for items to refresh from your cache.

I have set up my own database this weekend and it was not that hard.

rhbecker · May 31, 2016, 12:36am

I get emails, but they only contain a count - e.g. 5 things have changed in collection x. Are you receiving emails with more info than that?

Hopefully I’ve not given the impression that this is an issue? However simple the initial setup may be, maintaining a full database copy only to be notified of changes to a tiny fraction of that data seems like a waste of resources, and one more thing that can break for a project that doesn’t have the resources to manage unnecessary complexity.

IMO, one of the biggest benefits of exposing data over HTTP is that you spare many clients from dealing with the complexity and cost of data management and hosting - basically, an “economies of scale” benefit.

I’m not sure whether anyone in the discussion so far is representing the viewpoint of the core team. I’d like to know whether I’m off base in thinking this is an issue we should be able to solve at an api level, whether via some yet-to-be unveiled existing functionality, or an enhancement that I may or may not be involved in producing.

mulamelamup · May 31, 2016, 1:11pm

How “tiny” is your fraction exactly?
Is it centered on releases / release groups / recordings?
How fast do you need changes? There aren’t a lot of changes in most release groups, so it makes a HUGE difference wether you can wait two months for changes to reach you or just a day.

Freso · May 31, 2016, 1:16pm

I’m not one of the developers, but it is my understand that ws/2 is basically in maintenance mode. Smaller new features might get added (e.g., accompanying schema changes), but larger things are not likely to get done by anyone in the “core team”. You could open a ticket about it at http://tickets.musicbrainz.org though and even if it doesn’t get scheduled for ws/2 it might get considered for ws/3 whenever that actually gets close to being a real thing (not going to be this year - and likely not next year either).

Of course, if you decided to make a PR including this functionality in ws/2 it might make it in, but the employed developers are not going to spend their paid time on this any time soon. Either way, making a ticket is a good first step.

rhbecker · May 31, 2016, 3:30pm

30-50ish composers, focused only on their solo piano compositions, and the recordings thereof, along with the releases on which those recordings appear.

I’ve been considering this question for several weeks, and have come to realize that without doing more in-depth user research, I’ll remain uncertain what will be acceptable to our user base.

I’m a bit surprised at the range you present - 2 months versus “just” a day. Would you be willing to elaborate a bit on why you chose those durations, and the relative complexity of meeting a requirement at either end of that scale?

To be honest, I’d think that in 2016, we’d be talking about 1 day as the floor, and then calculating why near real time is probably too expensive for the immediate future, but something to start working towards in small increments.

hypothetical usage scenario

Suppose, for a moment, that the api enhancement I suggested as one possible solution is deemed worth pursuing by the community. I’d expect it to be utterly non-impactful to make a single request for edits related to a collection on, say, an hourly basis. Maybe the response to such a request indicates 3 changes, but only 2 are relevant to the particular data attributes I’ve cached - leading to 2 follow-up requests to get that refreshed data, assuming the original response didn’t already contain sufficient information for me to know what to update in my local cache.

And keep in mind that these 1-3 hypothetical hourly requests are all in place of each of my users hitting the api directly, possibly numerous times per minute. If other client developers were to start adopting a similar architecture (i.e. local caches of data), we could reduce the load on the api by a few orders of magnitude.

rhbecker · May 31, 2016, 3:35pm

@Freso: Thanks for that info … very helpful.

Is there anywhere I can read more about a roadmap for either or both of the api and general system architecture?

Would that be more likely to lead to a discussion of the relative value to the community of the sort of functionality I’m pursuing? I can’t afford to invest a significant amount of time in producing a PR only to discover after the fact that the core team has some irreconcilable issue with the principle behind what I’ve done, or that it’s redundant with respect to some other solution already in the works (or already existent).

mulamelamup · May 31, 2016, 4:11pm

I still don’t understand the usage scenario. From what information I’ve seen in this thread so far, I can’t figure out any reason why any kind of polling would be needed. It looks more like a job for a tiny bash script or a few lines of proxy / cache configuration.

So what am I missing / what other information do we need to understand what you require? Are you trying to use this database to notify you when new recordings / remixes of works come out or something like that?

You’re probably thinking about constantly changing and interacting social networks and communication platforms and stuff like that. This is a mostly static, “semi-finite” data set. Frequent changes / updates / syncing aren’t really part of “normal” usage. Think “wikipedia for a much smaller and less complex world that doesn’t change as fast”. Or “man pages”. Argh, I can’t come up with a good example.

Freso · May 31, 2016, 4:13pm

No idea. @Bitmap?

A ticket would be where the developers would say that they don’t like the idea of the feature at all. A ticket on http://tickets.musicbrainz.org/ is a step before making a pull request on GitHub.

Also, a better place to get in touch with the developers is on the MetaBrainz channel on IRC, where you can point them to your ticket and ask if they’d be willing to merge it if you made a PR.