Trying to avoid excessive polling of api

andreivolgin · May 31, 2016, 5:34pm

It did take me 2 hours to setup a new server using the Virtualbox solution, but it took me all night to pull the latest version of the database. I had to update/upgrade a ton of things, including installing new PostgreSQL, editing many config files, applying new schema changes, and finally pulling the latest data. I agree with you that this is not how things should work in 2016 when it comes to APIs.
There were 17,000+ edits to the database yesterday:

https://beta.musicbrainz.org/statistics

There are 16m recordings, 1.6m releases and 1m artists in the database, This gives you some perspective on the frequency of changes. I assume that many edits apply to the newly created entities, while most older entities stay unchanged for long periods of time.

docdem · May 31, 2016, 8:39pm

A commercial project, I presume?

rhbecker · June 1, 2016, 4:07am

Yeah, the addition of new recordings - both new to the world, and old recordings whose entry is new to musicbrainz - would be among the cases with respect to which I’d expect to see the most activity.

I think I get your point, though some of the numbers @andreivolgin shared later in the thread, and some of my own direct experiences with collection subscriptions seem somewhat inconsistent with your premise. Regardless, I’d suggest that frequency of change shouldn’t necessarily be directly correlated with the importance of timely notification of change.

If you’ll allow me to add to your inventory of inexact metaphors: Imagine a University creates an emergency alert system to warn folks on campus of dangerous situations. Perhaps it’s an exceedingly safe campus, and occasions for usage occur only once every few months. Surely you’d not argue that the infrequency of emergencies implies that notifications of the few that occur can lag by several months, weeks, days, or even hours.

Though we’re obviously not dealing with issues of public safety, users do typically have fairly high expectations for product quality, data integrity and freshness, etc. Though we don’t all have the resources of Google, Facebook, etc., the experiences users have with products developed by those big companies set the bar pretty high for the rest of us.

If I’m developing a product that relies on users trusting the quality of data, then of course I’d want to offer them the best data available, as close to as soon as it becomes available as is reasonably achievable. And if I’m using musicbrainz, and I tell my users I’m using musicbrainz, and they go add new data directly to musicbrainz, and it doesn’t show up in my system for several days, that reflects poorly on my product.

In any case, I feel like most of this discussion is somewhat moot. If it were exceedingly complex to support my use case, or my intended usage was going to produce excessive strain on musicbrainz resources, I’d understand the hesitance, but unless I’m really off base in my thinking, I believe the options I have in mind would reduce the strain, and at least some would be relatively simple to implement (and as alluded to elsewhere in this thread, possibly contributed by me).

If you still need further elaboration, after reading the above, ping me back and I’ll try to outline some sort of example data flow … or something. I’m fully invested in making sure I’m making myself clear!

rhbecker · June 1, 2016, 4:11am

Excellent. I’ll probably wait for discussion to die down in this thread so that the ticket I ultimately create incorporates all of the good information getting shared here, and so that my message gains clarity from all of the critical feedback being generously offered by thread participants.

rhbecker · June 1, 2016, 4:17am

@andreivolgin: Really appreciate the information and experiences you’re sharing. Thanks!

A small clarification: I hope it doesn’t sound like I’m critiquing the existing product(s). What I meant by my 2016 comment was more about how we should be thinking about the next place we want to go, and not a comment on where we presently are. Because where we presently are is a whole lot further along than I’d be on my own, without the cool stuff the *brainz community is doing.

I suspect we’re on the same page, but given that there’s already been a bit of push-back in this thread, along those lines, I wanted to take the opportunity to make my position clear.

rhbecker · June 1, 2016, 4:21am

I wish I had a more certain answer for you, but assuming that by “commercial” you mean “making or intended to make a profit”, I’d not commit to a response more specific than “maybe, but I’m not counting on it.”

I suppose I should have pretended that I’m on the brink of a rich revenue stream in order to maintain interest in this thread?

mulamelamup · June 1, 2016, 8:51am

That depends on whether you’re trying to get someone else to implement / support this.

You might really be better off just implementing it yourself and pushing a clean + small ( no changes to data structure, modifications in as few different places as possible… ) pull request. Discussions like this here can take a while and lead nowhere

That’s why I was tying to find (other) use cases - I’m not sure many people (the ones who are contributing to the database or project at least) want to go that way. Most people are probably more concerned with very different things.

[quote=“rhbecker, post:23, topic:60665”]
I believe the options I have in mind would reduce the strain, and at least some would be relatively simple to implement[/quote]
One of the main concerns here would probably be, that some sort of backwards compatibility would need to be retained somehow and a feature like that could easily be a future obstacle to improvements of the data structure.

I’d guess that kind of start would have gotten you rather a lot of negative attention, but I could be wrong.

A link to a working fork or a patch would probably have been a really good way to start

aerozol · June 1, 2016, 9:42am

[quote=“rhbecker, post:26, topic:60665”]
I suppose I should have pretended that I’m on the brink of a rich revenue stream in order to maintain interest in this thread?[/quote]
If you plan to make heavy use of the API it’s probably polite to talk to MB and get on this page at some point:
https://metabrainz.org/supporters

No pressure, but if you need MB to be flexible/ change something for your convenience, especially for commercial useage, I think some sort of exchange (financial or otherwise) would be expected.

Apart from that I don’t think you would be treated differently if you said you were sitting on a goldmine - at least I hope not

andreivolgin · June 1, 2016, 10:15am

@mulamelamup, @aerozol

I think you are missing the key point of the original poster: the idea is to reduce API usage, which benefits everyone - core team, editors and other API consumers. Right now the only way to update data is to (a) pull all edits, whether you need them or not, or (b) hit the API very often. This is ineffective on both MB and consumer sides, but large consumers can deal with it. It does represent a major hurdle for small and individual consumers.

I assume, of course, that MB is interested in growing its eco-system and influence, and therefore takes notice of users’ concerns. Regardless, a more efficient mechanism for edit notifications will benefit MB itself.

Also, it’s not easy to create a pull request for a project that represents 13 years worth of code and a very complex data model. Even a very experienced developer will hesitate to jump in because of the steep learning curve and a major time commitment even for a relatively simple new feature. In my own projects I can implement such a feature in a few hours. In a different project using a different set of technologies, it may easily take10 times longer.

aerozol · June 1, 2016, 10:24am

I actually think you’re missing my point, all I’m doing is answering a specific question he posed, and I’ve previously stated that feedback and criticism are welcome and encouraged.

Being aware of how MB works and what you can expect surely can’t hurt, and that’s what I’m trying to provide… apologies if it comes across as “so do it yourself” and “so pay for it” respectively, that’s not the intention

docdem · June 1, 2016, 10:48am

I was trying to assess what you intend to bring to the table. Up to now I see precious little meat, much talk, and a little snark here and there.

CallerNo6 · June 1, 2016, 6:38pm

It’d be great to find ways to make the data more easy-to-use for people who are working with small subsets.

As a (hopefully!) temporary workaround, I wonder how hard it would be to make a scraper bot that reads and json-ifies /collections/{id}/edits

In any case, thanks for being a conscientious data consumer

rhbecker · June 1, 2016, 11:35pm

To those who responded to this (@mulamelamup, @aerozol, @docdem): Apologies for wasting your time, as it was supposed to be a joke … I’ll try to make the next one actually funny.

rhbecker · June 1, 2016, 11:37pm

Though I agree that these conversations can be painful, I think they are a necessary evil and that jumping into code without first presenting the issue, confirming it’s actually an issue, and confirming which is the preferred approach to solving the issue is a much more significant waste of everyone’s time.

Please keep in mind that the core musicbrainz team are not the only ones with limited time and resources. @andreivolgin provided a good explanation of some of the expenses of jumping into code (3rd paragraph of his most recent post prior to this one).

Very true, and I didn’t mean to imply everyone should care about what I’m saying. But, for those who do care, if a solution is to be implemented, we’d certainly want to think about our options and select the one that returns the most value, per the time we invest. I can’t see that being a very controversial point.

Understood - thanks!

rhbecker · June 2, 2016, 12:01am

Well, part of the point of this whole thread is that I’m trying to reduce my quantity of API requests to a negligible number …

… but I get your point.

Like I’ve said, I’m still at proof-of-concept for this particular project - part of which is determining whether musicbrainz is the right system to build upon. Once I know what’s possible and what the expenses will be, we’ll start thinking about whether it’s worth the time investment, whether we can come up with a revenue model, whether we’re merely doing it for the love of it, etc.

My assumption is that if there is profit, there will be money exchange. And if profit doesn’t happen, we can at least contribute code, and certainly data (I’ve been actively contributing data content for about 1.5 years now).

Not at all … all information is welcome!

rhbecker · June 2, 2016, 12:04am

Can you elaborate on where you’d like to see more meat?

Do you understand the problem I presented in post #1? the option I presented in post #7? the usage scenario I offered in post #17? Some other information you’d expect to see from me?

As I said in an earlier post …

rhbecker · June 2, 2016, 12:06am

That’s a great way of presenting the “theme” of my inquiries. Who knows what new clients we might enable to do great things if we are able to lower the barrier to entry, or more gracefully support use cases that were not previously feasible.

In my experience (and I think the experience of many others), the problem with screen scraping is that, unlike an api, there’s no “contract” to rely on with respect to a UI, which means the assumptions of structure that you build your script against are fragile. The expense of writing the initial script, along with the likelihood of breakage and subsequent fixes, ends up costing you more than it would have cost to enhance the API.

As far as I’m concerned, all options are on the table, and I’ll at least look at what it would require if other options don’t seem feasible, but that would be towards the “last resort” end of the spectrum, and maybe even to the extent that we’d determine our project to be unsustainable.

rhbecker · June 2, 2016, 12:11am

Yes, yes, and yes. I’m glad you’re here. I could probably be doing a better job of explaining myself, but your responses assure me that my posts aren’t completely unintelligible.

rhbecker · June 9, 2016, 6:12am

I finally got around to creating a ticket for this:
http://tickets.musicbrainz.org/browse/MBS-8975

Wish me luck! Or, if you hated this idea, root against me!

Either way, thanks again for participating in the discussion.