Trying to avoid excessive polling of api

Upon further review of the API documentation, one option could be the introduction of an edit resource.

Each edit appears to already be a discretely identifiable resource - e.g. http://musicbrainz.org/edit/37148588. And, based on the collection UIs I’ve already referenced, it seems like all of the data is properly hooked up in the schema to associate edits with collections.

I could then presumably execute a “browse” request for edits related to a collection, just like I can execute a “browse” request for works related to a collection …

already supported: works in a collection

http://musicbrainz.org/ws/2/work?collection=19d9574a-22ff-4bbf-919b-39ea379f854a

proposed: edits to a collection

http://musicbrainz.org/ws/2/edit?collection=19d9574a-22ff-4bbf-919b-39ea379f854a

The technical issues here are way over my head sorry! But a (not very helpful) note -

MB is open source and non-profit. If there’s any seemingly obvious or basic elements that you expect to be here, they can be added by you or someone you know.
Criticism and feedback is always welcome, but, unlike other companies, you can’t really throw your hands into the air and say “why does MB not have this!”. There’s not really anybody or anything to be “disappointed” in that doesn’t include yourself. Makes life much harder :frowning:

Again, I don’t want to silence your criticism and feedback, people are paying attention! But it’s important to keep things in perspective :slight_smile:
Cheers

@aerozol: Understood. I’m not sure which aspects of my posts seem overly critical from your perspective. Perhaps …

To clarify, my disappointment in that scenario is purely hypothetical, as I’m fairly certain there’s more direct ways to accomplish what I’m after. So, if that ends up as my “best bet”, that would seemingly indicate a more fundamental problem, such as an unanticipated architectural obstacle, some major flaw in schema design, or disinterest from the community in improving the api in a way that could have a positive impact on server load. Given that I don’t expect any of that to happen, any of that happening would be disappointing.

2 Likes

Also, I recognize this is a possibility. Part of the point of my post was to determine whether there may be some existing functionality that I haven’t yet discovered that would satisfy my needs. Just as the core team has to be cautious with how they invest their time, I too need to ensure I’m not reinventing the wheel.

1 Like

@rhbecker - I just went through the exercise of setting up my own copy of MB server. It took 2 hours. I confirmed that I can run the same JSON API calls against my own copy. This is a game-changer for me, as I can now run jobs 10-30 times faster compared to using a public API based on my initial tests.

I understand that this solution maybe an overkill for your needs, but for me these 2 hours were definitely a good investment.

1 Like

Cool, thanks for sharing @andreivolgin. If I do end up needing to contribute back code, as @aerozol suggested, I’ll inevitably need to take that plunge to test my work. May I ask which route you went with to set it up? I see in the other thread that you’re likely using this method:

Musicbrainz does send out email based on subscriptions. So potentially you could parse these emails looking for items to refresh from your cache.

I have set up my own database this weekend and it was not that hard.

1 Like

I get emails, but they only contain a count - e.g. 5 things have changed in collection x. Are you receiving emails with more info than that?

Hopefully I’ve not given the impression that this is an issue? However simple the initial setup may be, maintaining a full database copy only to be notified of changes to a tiny fraction of that data seems like a waste of resources, and one more thing that can break for a project that doesn’t have the resources to manage unnecessary complexity.

IMO, one of the biggest benefits of exposing data over HTTP is that you spare many clients from dealing with the complexity and cost of data management and hosting - basically, an “economies of scale” benefit.

I’m not sure whether anyone in the discussion so far is representing the viewpoint of the core team. I’d like to know whether I’m off base in thinking this is an issue we should be able to solve at an api level, whether via some yet-to-be unveiled existing functionality, or an enhancement that I may or may not be involved in producing.

How “tiny” is your fraction exactly?
Is it centered on releases / release groups / recordings?
How fast do you need changes? There aren’t a lot of changes in most release groups, so it makes a HUGE difference wether you can wait two months for changes to reach you or just a day.

I’m not one of the developers, but it is my understand that ws/2 is basically in maintenance mode. Smaller new features might get added (e.g., accompanying schema changes), but larger things are not likely to get done by anyone in the “core team”. You could open a ticket about it at http://tickets.musicbrainz.org though and even if it doesn’t get scheduled for ws/2 it might get considered for ws/3 whenever that actually gets close to being a real thing (not going to be this year - and likely not next year either).

Of course, if you decided to make a PR including this functionality in ws/2 it might make it in, but the employed developers are not going to spend their paid time on this any time soon. Either way, making a ticket is a good first step.

1 Like

30-50ish composers, focused only on their solo piano compositions, and the recordings thereof, along with the releases on which those recordings appear.

I’ve been considering this question for several weeks, and have come to realize that without doing more in-depth user research, I’ll remain uncertain what will be acceptable to our user base.

I’m a bit surprised at the range you present - 2 months versus “just” a day. Would you be willing to elaborate a bit on why you chose those durations, and the relative complexity of meeting a requirement at either end of that scale?

To be honest, I’d think that in 2016, we’d be talking about 1 day as the floor, and then calculating why near real time is probably too expensive for the immediate future, but something to start working towards in small increments.

hypothetical usage scenario

Suppose, for a moment, that the api enhancement I suggested as one possible solution is deemed worth pursuing by the community. I’d expect it to be utterly non-impactful to make a single request for edits related to a collection on, say, an hourly basis. Maybe the response to such a request indicates 3 changes, but only 2 are relevant to the particular data attributes I’ve cached - leading to 2 follow-up requests to get that refreshed data, assuming the original response didn’t already contain sufficient information for me to know what to update in my local cache.

And keep in mind that these 1-3 hypothetical hourly requests are all in place of each of my users hitting the api directly, possibly numerous times per minute. If other client developers were to start adopting a similar architecture (i.e. local caches of data), we could reduce the load on the api by a few orders of magnitude.

@Freso: Thanks for that info … very helpful.

Is there anywhere I can read more about a roadmap for either or both of the api and general system architecture?

Would that be more likely to lead to a discussion of the relative value to the community of the sort of functionality I’m pursuing? I can’t afford to invest a significant amount of time in producing a PR only to discover after the fact that the core team has some irreconcilable issue with the principle behind what I’ve done, or that it’s redundant with respect to some other solution already in the works (or already existent).

I still don’t understand the usage scenario. From what information I’ve seen in this thread so far, I can’t figure out any reason why any kind of polling would be needed. It looks more like a job for a tiny bash script or a few lines of proxy / cache configuration.

So what am I missing / what other information do we need to understand what you require? Are you trying to use this database to notify you when new recordings / remixes of works come out or something like that?

You’re probably thinking about constantly changing and interacting social networks and communication platforms and stuff like that. This is a mostly static, “semi-finite” data set. Frequent changes / updates / syncing aren’t really part of “normal” usage. Think “wikipedia for a much smaller and less complex world that doesn’t change as fast”. Or “man pages”. Argh, I can’t come up with a good example.

1 Like

No idea. @Bitmap?

A ticket would be where the developers would say that they don’t like the idea of the feature at all. A ticket on http://tickets.musicbrainz.org/ is a step before making a pull request on GitHub.

Also, a better place to get in touch with the developers is on the MetaBrainz channel on IRC, where you can point them to your ticket and ask if they’d be willing to merge it if you made a PR.

2 Likes

@rhbecker

  1. It did take me 2 hours to setup a new server using the Virtualbox solution, but it took me all night to pull the latest version of the database. I had to update/upgrade a ton of things, including installing new PostgreSQL, editing many config files, applying new schema changes, and finally pulling the latest data. I agree with you that this is not how things should work in 2016 when it comes to APIs.

  2. There were 17,000+ edits to the database yesterday:

https://beta.musicbrainz.org/statistics

There are 16m recordings, 1.6m releases and 1m artists in the database, This gives you some perspective on the frequency of changes. I assume that many edits apply to the newly created entities, while most older entities stay unchanged for long periods of time.

1 Like

A commercial project, I presume?

Yeah, the addition of new recordings - both new to the world, and old recordings whose entry is new to musicbrainz - would be among the cases with respect to which I’d expect to see the most activity.

I think I get your point, though some of the numbers @andreivolgin shared later in the thread, and some of my own direct experiences with collection subscriptions seem somewhat inconsistent with your premise. Regardless, I’d suggest that frequency of change shouldn’t necessarily be directly correlated with the importance of timely notification of change.

If you’ll allow me to add to your inventory of inexact metaphors: Imagine a University creates an emergency alert system to warn folks on campus of dangerous situations. Perhaps it’s an exceedingly safe campus, and occasions for usage occur only once every few months. Surely you’d not argue that the infrequency of emergencies implies that notifications of the few that occur can lag by several months, weeks, days, or even hours.

Though we’re obviously not dealing with issues of public safety, users do typically have fairly high expectations for product quality, data integrity and freshness, etc. Though we don’t all have the resources of Google, Facebook, etc., the experiences users have with products developed by those big companies set the bar pretty high for the rest of us.

If I’m developing a product that relies on users trusting the quality of data, then of course I’d want to offer them the best data available, as close to as soon as it becomes available as is reasonably achievable. And if I’m using musicbrainz, and I tell my users I’m using musicbrainz, and they go add new data directly to musicbrainz, and it doesn’t show up in my system for several days, that reflects poorly on my product.

In any case, I feel like most of this discussion is somewhat moot. If it were exceedingly complex to support my use case, or my intended usage was going to produce excessive strain on musicbrainz resources, I’d understand the hesitance, but unless I’m really off base in my thinking, I believe the options I have in mind would reduce the strain, and at least some would be relatively simple to implement (and as alluded to elsewhere in this thread, possibly contributed by me).

If you still need further elaboration, after reading the above, ping me back and I’ll try to outline some sort of example data flow … or something. I’m fully invested in making sure I’m making myself clear!

1 Like

Excellent. I’ll probably wait for discussion to die down in this thread so that the ticket I ultimately create incorporates all of the good information getting shared here, and so that my message gains clarity from all of the critical feedback being generously offered by thread participants.

@andreivolgin: Really appreciate the information and experiences you’re sharing. Thanks!

A small clarification: I hope it doesn’t sound like I’m critiquing the existing product(s). What I meant by my 2016 comment was more about how we should be thinking about the next place we want to go, and not a comment on where we presently are. Because where we presently are is a whole lot further along than I’d be on my own, without the cool stuff the *brainz community is doing.

I suspect we’re on the same page, but given that there’s already been a bit of push-back in this thread, along those lines, I wanted to take the opportunity to make my position clear.

1 Like

I wish I had a more certain answer for you, but assuming that by “commercial” you mean “making or intended to make a profit”, I’d not commit to a response more specific than “maybe, but I’m not counting on it.”

I suppose I should have pretended that I’m on the brink of a rich revenue stream in order to maintain interest in this thread?