Trying to avoid excessive polling of api

Tags: #<Tag:0x00007f05056af3e0>


Cool, thanks for sharing @andreivolgin. If I do end up needing to contribute back code, as @aerozol suggested, I’ll inevitably need to take that plunge to test my work. May I ask which route you went with to set it up? I see in the other thread that you’re likely using this method:


Musicbrainz does send out email based on subscriptions. So potentially you could parse these emails looking for items to refresh from your cache.

I have set up my own database this weekend and it was not that hard.


I get emails, but they only contain a count - e.g. 5 things have changed in collection x. Are you receiving emails with more info than that?

Hopefully I’ve not given the impression that this is an issue? However simple the initial setup may be, maintaining a full database copy only to be notified of changes to a tiny fraction of that data seems like a waste of resources, and one more thing that can break for a project that doesn’t have the resources to manage unnecessary complexity.

IMO, one of the biggest benefits of exposing data over HTTP is that you spare many clients from dealing with the complexity and cost of data management and hosting - basically, an “economies of scale” benefit.

I’m not sure whether anyone in the discussion so far is representing the viewpoint of the core team. I’d like to know whether I’m off base in thinking this is an issue we should be able to solve at an api level, whether via some yet-to-be unveiled existing functionality, or an enhancement that I may or may not be involved in producing.


How “tiny” is your fraction exactly?
Is it centered on releases / release groups / recordings?
How fast do you need changes? There aren’t a lot of changes in most release groups, so it makes a HUGE difference wether you can wait two months for changes to reach you or just a day.


I’m not one of the developers, but it is my understand that ws/2 is basically in maintenance mode. Smaller new features might get added (e.g., accompanying schema changes), but larger things are not likely to get done by anyone in the “core team”. You could open a ticket about it at though and even if it doesn’t get scheduled for ws/2 it might get considered for ws/3 whenever that actually gets close to being a real thing (not going to be this year - and likely not next year either).

Of course, if you decided to make a PR including this functionality in ws/2 it might make it in, but the employed developers are not going to spend their paid time on this any time soon. Either way, making a ticket is a good first step.


30-50ish composers, focused only on their solo piano compositions, and the recordings thereof, along with the releases on which those recordings appear.

I’ve been considering this question for several weeks, and have come to realize that without doing more in-depth user research, I’ll remain uncertain what will be acceptable to our user base.

I’m a bit surprised at the range you present - 2 months versus “just” a day. Would you be willing to elaborate a bit on why you chose those durations, and the relative complexity of meeting a requirement at either end of that scale?

To be honest, I’d think that in 2016, we’d be talking about 1 day as the floor, and then calculating why near real time is probably too expensive for the immediate future, but something to start working towards in small increments.

hypothetical usage scenario

Suppose, for a moment, that the api enhancement I suggested as one possible solution is deemed worth pursuing by the community. I’d expect it to be utterly non-impactful to make a single request for edits related to a collection on, say, an hourly basis. Maybe the response to such a request indicates 3 changes, but only 2 are relevant to the particular data attributes I’ve cached - leading to 2 follow-up requests to get that refreshed data, assuming the original response didn’t already contain sufficient information for me to know what to update in my local cache.

And keep in mind that these 1-3 hypothetical hourly requests are all in place of each of my users hitting the api directly, possibly numerous times per minute. If other client developers were to start adopting a similar architecture (i.e. local caches of data), we could reduce the load on the api by a few orders of magnitude.


@Freso: Thanks for that info … very helpful.

Is there anywhere I can read more about a roadmap for either or both of the api and general system architecture?

Would that be more likely to lead to a discussion of the relative value to the community of the sort of functionality I’m pursuing? I can’t afford to invest a significant amount of time in producing a PR only to discover after the fact that the core team has some irreconcilable issue with the principle behind what I’ve done, or that it’s redundant with respect to some other solution already in the works (or already existent).


I still don’t understand the usage scenario. From what information I’ve seen in this thread so far, I can’t figure out any reason why any kind of polling would be needed. It looks more like a job for a tiny bash script or a few lines of proxy / cache configuration.

So what am I missing / what other information do we need to understand what you require? Are you trying to use this database to notify you when new recordings / remixes of works come out or something like that?

You’re probably thinking about constantly changing and interacting social networks and communication platforms and stuff like that. This is a mostly static, “semi-finite” data set. Frequent changes / updates / syncing aren’t really part of “normal” usage. Think “wikipedia for a much smaller and less complex world that doesn’t change as fast”. Or “man pages”. Argh, I can’t come up with a good example.


No idea. @Bitmap?

A ticket would be where the developers would say that they don’t like the idea of the feature at all. A ticket on is a step before making a pull request on GitHub.

Also, a better place to get in touch with the developers is on the #metabrainz channel on IRC, where you can point them to your ticket and ask if they’d be willing to merge it if you made a PR.



  1. It did take me 2 hours to setup a new server using the Virtualbox solution, but it took me all night to pull the latest version of the database. I had to update/upgrade a ton of things, including installing new PostgreSQL, editing many config files, applying new schema changes, and finally pulling the latest data. I agree with you that this is not how things should work in 2016 when it comes to APIs.

  2. There were 17,000+ edits to the database yesterday:

There are 16m recordings, 1.6m releases and 1m artists in the database, This gives you some perspective on the frequency of changes. I assume that many edits apply to the newly created entities, while most older entities stay unchanged for long periods of time.


A commercial project, I presume?


Yeah, the addition of new recordings - both new to the world, and old recordings whose entry is new to musicbrainz - would be among the cases with respect to which I’d expect to see the most activity.

I think I get your point, though some of the numbers @andreivolgin shared later in the thread, and some of my own direct experiences with collection subscriptions seem somewhat inconsistent with your premise. Regardless, I’d suggest that frequency of change shouldn’t necessarily be directly correlated with the importance of timely notification of change.

If you’ll allow me to add to your inventory of inexact metaphors: Imagine a University creates an emergency alert system to warn folks on campus of dangerous situations. Perhaps it’s an exceedingly safe campus, and occasions for usage occur only once every few months. Surely you’d not argue that the infrequency of emergencies implies that notifications of the few that occur can lag by several months, weeks, days, or even hours.

Though we’re obviously not dealing with issues of public safety, users do typically have fairly high expectations for product quality, data integrity and freshness, etc. Though we don’t all have the resources of Google, Facebook, etc., the experiences users have with products developed by those big companies set the bar pretty high for the rest of us.

If I’m developing a product that relies on users trusting the quality of data, then of course I’d want to offer them the best data available, as close to as soon as it becomes available as is reasonably achievable. And if I’m using musicbrainz, and I tell my users I’m using musicbrainz, and they go add new data directly to musicbrainz, and it doesn’t show up in my system for several days, that reflects poorly on my product.

In any case, I feel like most of this discussion is somewhat moot. If it were exceedingly complex to support my use case, or my intended usage was going to produce excessive strain on musicbrainz resources, I’d understand the hesitance, but unless I’m really off base in my thinking, I believe the options I have in mind would reduce the strain, and at least some would be relatively simple to implement (and as alluded to elsewhere in this thread, possibly contributed by me).

If you still need further elaboration, after reading the above, ping me back and I’ll try to outline some sort of example data flow … or something. I’m fully invested in making sure I’m making myself clear!


Excellent. I’ll probably wait for discussion to die down in this thread so that the ticket I ultimately create incorporates all of the good information getting shared here, and so that my message gains clarity from all of the critical feedback being generously offered by thread participants.


@andreivolgin: Really appreciate the information and experiences you’re sharing. Thanks!

A small clarification: I hope it doesn’t sound like I’m critiquing the existing product(s). What I meant by my 2016 comment was more about how we should be thinking about the next place we want to go, and not a comment on where we presently are. Because where we presently are is a whole lot further along than I’d be on my own, without the cool stuff the *brainz community is doing.

I suspect we’re on the same page, but given that there’s already been a bit of push-back in this thread, along those lines, I wanted to take the opportunity to make my position clear.


I wish I had a more certain answer for you, but assuming that by “commercial” you mean “making or intended to make a profit”, I’d not commit to a response more specific than “maybe, but I’m not counting on it.”

I suppose I should have pretended that I’m on the brink of a rich revenue stream in order to maintain interest in this thread?


That depends on whether you’re trying to get someone else to implement / support this. :wink:

You might really be better off just implementing it yourself and pushing a clean + small ( no changes to data structure, modifications in as few different places as possible… ) pull request. Discussions like this here can take a while and lead nowhere :wink:

That’s why I was tying to find (other) use cases - I’m not sure many people (the ones who are contributing to the database or project at least) want to go that way. Most people are probably more concerned with very different things.

[quote=“rhbecker, post:23, topic:60665”]
I believe the options I have in mind would reduce the strain, and at least some would be relatively simple to implement[/quote]
One of the main concerns here would probably be, that some sort of backwards compatibility would need to be retained somehow and a feature like that could easily be a future obstacle to improvements of the data structure.

I’d guess that kind of start would have gotten you rather a lot of negative attention, but I could be wrong.

A link to a working fork or a patch would probably have been a really good way to start :smiley:


[quote=“rhbecker, post:26, topic:60665”]
I suppose I should have pretended that I’m on the brink of a rich revenue stream in order to maintain interest in this thread?[/quote]
If you plan to make heavy use of the API it’s probably polite to talk to MB and get on this page at some point:

No pressure, but if you need MB to be flexible/ change something for your convenience, especially for commercial useage, I think some sort of exchange (financial or otherwise) would be expected.

Apart from that I don’t think you would be treated differently if you said you were sitting on a goldmine - at least I hope not :stuck_out_tongue:


@mulamelamup, @aerozol

I think you are missing the key point of the original poster: the idea is to reduce API usage, which benefits everyone - core team, editors and other API consumers. Right now the only way to update data is to (a) pull all edits, whether you need them or not, or (b) hit the API very often. This is ineffective on both MB and consumer sides, but large consumers can deal with it. It does represent a major hurdle for small and individual consumers.

I assume, of course, that MB is interested in growing its eco-system and influence, and therefore takes notice of users’ concerns. Regardless, a more efficient mechanism for edit notifications will benefit MB itself.

Also, it’s not easy to create a pull request for a project that represents 13 years worth of code and a very complex data model. Even a very experienced developer will hesitate to jump in because of the steep learning curve and a major time commitment even for a relatively simple new feature. In my own projects I can implement such a feature in a few hours. In a different project using a different set of technologies, it may easily take10 times longer.


I actually think you’re missing my point, all I’m doing is answering a specific question he posed, and I’ve previously stated that feedback and criticism are welcome and encouraged.

Being aware of how MB works and what you can expect surely can’t hurt, and that’s what I’m trying to provide… apologies if it comes across as “so do it yourself” and “so pay for it” respectively, that’s not the intention :wink:


I was trying to assess what you intend to bring to the table. Up to now I see precious little meat, much talk, and a little snark here and there.