I’m inclined to say - yes, definitely, such things should be allowed. If someone wants to add metadata to VideoBrainz about their kid making weird sounds and banging on a pot; why should we not allow it? What harm does it do? One response to “What harm does it do?” is that it may clutter the database with data that most people find irrelevant and/or annoying; thus hindering the usefulness and appeal of VideoBrainz. But I think that is resolvable by establishing good categorization and attribution policies together with good search/filter capabilities that employ those category and attribution data.
Exactly. One of the key phrases in the Purpose statement is: “about every video ever produced”. Perhaps that should be changed to: “about every video ever published”?
I need to walk that back a bit. For Movie buffs, and TV Buffs too I suppose, titles are advertised well in advance of actual publication. It seems to me that collecting data about yet-to-be released videos is something we want to support as well. So perhaps the more precise definition is: “every video ever published or with a reasonable expectation of publication”. And publication simply means that the video is made available to a reasonably sized audience/market. Essentially, all except “personal” videos shared with a small circle of friends.
But again I have to ask - what is the harm of even including “personal” videos? I’m not saying that we should go out of our way to support such. But should we go out of our way to exclude them?
I’d like to examine that point a bit more closely. Assume that we can ensure videos are properly categorized and attributed; so that search/filter mechanisms work well. Do the videos entered into the database need to be accessible? Do they need to physically exist? I certainly don’t think we should support things that are purely fictional. But what about old movies that have been lost? What about videos that are not publicly available; but are known about publicly. I think I’m convincing myself that “every video ever produced” is the right phrase. I’m certain the vast majority of entries will be about published items. But why should we excluded videos that have not been published? What is the harm in allowing them?
I think that the categories we devise will have very precise qualification standards. To be a “Movie”, the video must meet the precise definition of a movie. But that doesn’t mean that we can’t have a category: “Other”; which simply means - it doesn’t fit any of the other categories; but it is a video.
I agree that making a GitHub organisation is a good idea. I think the next logical step would be to try to arrange a meeting (or series of meetings) on IRC involving the key players and try to work out your ontology. You could use Doodle to find a time that works well for everybody.
Having “Artist”, “Creator”, “Publisher” and “Label” entities isn’t really optimal. These entity names refer to roles in a relationship with content. Ideally from my point of view we would instead have “Person” and “Collective”, which could then be used for all four of the above, with MBIDs shared between the different Brainz. Similarly, all works could be made database-independent.
I think one you’ve got your ontology set out, then you should create an SQL schema, like we have in bookbrainz-sql. This is fairly technology independent, and would help get everybody on the same page. I’d suggest using the same vote-less, revertible revision system as BB, because the MB editing system is moving towards that with more and more auto-edits.
I like the idea of being able to record metadata for any YouTube video. Not so keen on storing information for private videos - if they’re not going to be accessible to anyone else, I don’t see the point in storing the metadata for them for others.
I’m still very uncomfortable with IRC. Just not my thing. I’ll go there when I must. I logged into today and added VideoBrainz to the weekly topic list, for example. But I prefer other means of communication - strongly. I tend to be much more of a contemplative developer / collaborator. Which is a polite way to say that I think slowly and tend to express my thoughts in depth (some would say all my posts should start with TL;DR;). Chat oriented collaboration environments tend to make me shut up; which might be a good thing - I don’t know.
I tend to agree with you there. I prefer the names “Person” and “Organization”; but I guess “Collective” is ok - but it gives me visions of the Borg. I view Artist, Creator, Publisher, Label, Best-Boy, Actor, etc as part of a vocabulary of roles - a taxonomy.
Yup - I think we’re all on the same page. I need to learn more about the “same vote-less, revertable revision system as BB”. Beyond “use the source Luke” approach to learning something - is this described somewhere that describes the design?
So also exclude data on lost videos (old movies that have been lost due to fire, chemical deterioration, etc.) but are of historical interest? But I’ll come back with my question: what’s the harm? I guarantee that I can put private music in MusicBrainz. I bet you that some folks already have. Consider the following use cases:
Collection Management. MusicBrainz has some support for this. Why not embrace it 100% and let users store information about private videos that are in their collection. So long as it’s labeled/attributed properly - what’s the harm?
Usage data. Lots of folks want to collect data on listening/watching their media. Even if the video isn’t available publicly; someone may want to share the fact that they listened to / watched something they own privately.
There does seem to be a fairly wide spread and consistent opinion that private items should not be included. But I really haven’t seen a convincing reason given as to why. So I ask again: what’s the harm?
Seems like a good idea to me. I think our priorities will diverge somewhat. For example, I think associated imagery is more important to VideoBrainz than it is for BookBrainz. But that’s not a bad thing.
I’d like to come to a better understanding of the technology selections that have been made in BB. Python, for example, was described by @Zastai as “a language i hate with a passion”. Personally I love Python and have used it off and on for decades. Personal likes and dislikes have a role in these decisions; but I prefer to make such decisions mostly on the basis of sound technical reasons. So - why have you chosen to use the languages, frameworks, and packages that you have?
I only prefer “Collective” over “Organization” because it seems to work better for describing a band, for instance - a collective can be a collective of musicians or a collective of businessmen in a company, while organization is more suited to the latter.
It’s sort of like NES
but it has diverged a bit. We don’t have any documentation of the exact system we’ve implemented (yet).
The difference here for me is that lost movies were at one point publicly available, so recording them has historical usefulness. My problem with including private videos is that it pollutes the publicly useful data, potentially making it less useful. For example, if a user searches through the database for a certain phrase, and there are a large number of private videos with titles containing the phrase, the useful public metadata might not be found. And metadata for private content will be of lower quality than the public metadata, since editors won’t be collaborating on improving it. If those matters were solved or made irrelevant, I wouldn’t be against storing private metadata.
We definitely want covers in BB, but the process of getting it set up has stalled a little (@reosarevok - let’s discuss that soon!). I guess VB might also want promotional posters (although BB could use them too), promotional stills, perhaps any bonus concept art included in the release? The big obstacle to getting all of this done is communicating what we want with the internet archive and getting stuff set up there.
I am agreeing with this mostly. When you look at the “rules” in MusicBrainz, meaning what the majority of auto-editors enforce, publically available bootlegs (private) even if available online sometimes do not qualify as a release. So not to call anyone out, I will just say I have examples of this being strictly enforced. I have come to learn that this does not matter to me either way. I was more pro private data when I first started, now I am more against private data, even mildly public data…
The mindset I have come to have, based on and molded by mostly auto-editors in MusicBrainz, is to look at the release and ask a few things. First, is, or was, this release available to the public in a manner that would give it distribution? Second, would the adding of this release benefit anyone aside from myself? And lastly, is the data I have to add complete enough that if no one ever is able to add anything to this release, does my add fairly portray a release that can be identified or of any use?
The last question is more open. If all I have is a track list and recording times with nothing else to offer, to me, that is not a release as it is useless. There is no cover, label, specific release info, source, etc. How would anyone ever accurately match this to their content?
I hope I have explained what I have been taught here. Although I have more or less signed onto this logic, I am not dismissing of the idea to go against it. I just wanted to toss this logic out there for all to consider in addition to what @LordSputnik stated above. It is my belief now that a private video (someone kid playing, a graduation, people at a shooting range for fun, etc) is no different than a home-made compilation of music.
Thank you. I do think these issues could be resolved; but it would take a lot of time and effort and would introduce concepts that are currently foreign to the Brainz community; and probably would impact the performance of certain queries. I think that to work properly, private metadata would need to appear as if it is not there at all unless you explicitly choose to include it. So far as reviews and editors are concerned; it’s certainly true that the community cannot verify the accuracy of private data. But it can enforce certain standards of quality such as completeness, internal consistency, spelling and grammar, quality of images provided, and proper attribution to a limited extent.
Another approach would be to address the “Collection Management” use case directly. By that I mean that there would be completely separate and simplified tables to handle private data. This data would only be visible to the account that created them and would be provided purely as a service to that user. Cloud storage for personal media metadata, if you will. But that definitely goes counter to the established purpose of the Brainz community. As an aside, I really want to refer to the community as the MediaBrainz community.
So I guess I’ll have to find some other avenue to satisfy the Collection Management use case. I fully expect to maintain the idea of personal collections as MusicBrainz does. But that would still be inadequate because it will lack certain critical features. Together with the Media Player use case; these two use cases are very important to me. I care as much or more about my private media as I do about the public data. I want to collect extensive information about them; just as much as I want to about the public items. I want that data to be on the internet so I can share it with those few people who care (my extended family and friends), and with myself - I love the cloud paradigm. But currently I have few options; and all are inadequate in several ways. Sad Panda
Yeah - rate limits and such are absolutely critical. And yes, we’ll need to support various mechanisms of the OAuth 2.0 protocol including a token/secret approach. I also think that much of the interface will work without authentication. Essentially, unauthenticated connections would be read-only. No edits, no comments, no ratings, no usage data submissions.
Speaking of OAuth 2.0. How is it deployed/integrated in existing MetaBrainz servers and services? Is this deployment/integration evolving? What is the end goal?
The mention of free-form relationships so anyone can just add a role doesn’t sit well with me. Even leaving aside the quality problems (misspellings, case differences, …), this is very English-centric. In order for the data to be localizable properly, I think relationships need to be curated properly.
Hmmm. I’m trying to follow this thought in the context of this thread and I don’t see it discussed in the context of VideoBrainz much. Are you bringing in topics that have occurred elsewhere into this discussion - because I’m confused. However, your comment does bring in several topics worthy of discussion. I see issues of editorial rights, schema design approaches, and localization entering in with your comment.
On Editorial rights. I don’t see why every account should have equal access to the database. I believe our purpose is to create a free and open database of video metadata of the highest quality, accuracy, and breadth. If that is the case, we have several problems we need to confront.
First, we need to encourage people to become contributors. The more contributors we have, the more data we will accumulate. The videos I’m most interested in will differ widely from other people. And by pulling from as broad a base as possible we will have a database that spans as many interests and cultures as possible. However, that comes with its own problem. Not all contributors do good work.
And that brings me to the second problem: we need to encourage high quality contributions. We cannot rely entirely on the editorial process to keep the data of high quality. As Oliver Charles pointed out years ago in this blog post, submissions are outpacing the ability of reviewers to look everything over. While the new editing system is designed to make matters better; I highly doubt it will eliminate it. We will need a hierarchy of accounts with increasing levels of access to the database. What those tiers are, what rights go with them, and how people move up and down the tiers is not something that I’m ready to discuss. But the existence of different levels of access and editorial rights I think is clearly needed.
Schema Design Approaches Part of the ontology will be various taxonomies that categorize things. Some of these will occur in relationships. For example, suppose we have a “Movie” entity, and a “Person” entity. We may then have relationships between those two kinds of entities. A specific Movie will have a whole host of Person’s who contributed to that Movie in one form or another: the Cast and Crew to employ the terms typically used. So we’ll have Cast relationships which will also include what “Role” that Actor performed. We’ll also have Crew relationships which will describe the Job that Person had in the production of the Movie. Will we have a semi-fixed taxonomy of jobs? If we do, how is that taxonomy created and maintained? Will it evolve over time? I think that this list will exist, and that it will change over time, and that the process for changing that list will be similar to changing any other part of the schema - in other words - very difficult and with significant impact. Changing any taxonomy that classifies items will imply that all data created using the older version of the list may need to be changed to use the new categorization system.
One approach that we can have to managing such taxonomies is to represent the taxonomy in the database itself. Not as part of the database definition; but in tables of its own. But having those represented in the database does not mean that they are freely editable.
Localization Localization needs to be a first tier capability. What I mean by that is that localization of all data should be incorporated in all of our designs from the very beginning. The discussion on instruments with disambiguation comments has been very instructive. As in the case of instruments, job titles in VideoBrainz will have need of translations. We are in a good position to ask the MusicBrainz folks: “If you could design the schema from scratch now - what would you do different?” Maintenance of the MusicBrainz schema is significantly burdened with a large existing database and mature collection of software built up. Fundamental changes to the schema represents a HUGE undertaking. VideoBrainz currently doesn’t have that burden so has the opportunity to make fundamental changes to the approach.
My other reply really went on several tangents and didn’t directly address this comment. I have a couple questions: How is the use of free-form relationships “very English-centric”? How does structure of the schema become language specific?
What do you mean by “free-form relationships”? Where was that discussed? Can you point me to the discussion?
That sounds a bit like you want to have the user just write down what the person did rather than pick from a closed set of options, which makes it hard (if not impossible) to translate it simply.
This is a bad idea IMO. MusicBrainz is trying to move away from votes and auto-editorship and towards a Wikipedia style “just revert errors” philosophy, because of multiple reasons, but among them, that if you expect most people add good (or at least not bad) information, putting roadblocks on them is not ideal, and it discourages additions - which, when they happen, either go unnoticed, or are policed too strictly (since some users would rather reject any submission that isn’t perfect, which is clearly problematic because not having any data is worse than having imperfect data).
There are a few things that we do intend to keep limited (who can add new relationship types, probably areas and instruments) but the general idea is that there should be as little a difference as possible between all users, and that we shouldn’t roadblock people (like we currently do with the 7 days voting phase).
I think you read too much into what I said. In fact, you said: “There are a few things that we do intend to keep limited”. So you agree that there are multiple levels of access? In fact, I agree with all that you said. But it’s important to recognize that, however limited, not all editing is equal. Some things must be controlled. I would even go so far as to give certain “moderators” the power to, very judiciously, lock certain entries or to restrict certain editors. It’s a rare occurrence; but sometimes there can be “editing” wars that degrade the database or rogue editors that seem to want to do what they want to do regardless of the rules and guidelines. Having the ability to intervene in such cases is a necessary ability; but should be used as a last resort. As a rule I am an optimist who believes that most of our editors seek the best interests of the community; and that some data is better than no data.
That wasn’t my intent at all. In fact, I believe that we should have a carefully worked out taxonomy of jobs. Regarding the “Roles” - well, the editor would need to provide that. Unless we wanted to have a new entity for “Roles”. That could be interesting because some roles are recurrent; how many “Bat Man’s” have we had? How many “Sherlock Holms”, or whatever. Or perhaps “Person” should include fictitious persons. But I digress.
On the topic of “Jobs” for crew. There does seem to be a discernable taxonomy. But that taxonomy is itself evolving. I have an idea on how to handle the seemingly conflicting desire to control changes to taxonomies such as “Jobs”, as well as reduce roadblocks to editing as much as possible. My idea is that we should have a taxonomy - with very tight controls on making changes to that taxonomy. But that taxonomy should always include an “Other” category. When a user uses the “Other” crew job; they should also provide (or be able to provide) additional data so that those who update the Jobs taxonomy can use it as input for extending it.
I’m totally with you here. This is exactly the thing I most like about MusicBrainz. I don’t see any problem with this. All of the things you just listed are a variation of a single kind of relationship - associating a person with the video of interest. It’s a simple relationship that states ; i.e. <“Joe Dancer”, “My Dance Video”, “Dancer”>. There will certainly be many discussions on what kinds of relationships there should be; much like there is for MusicBrainz. But in the end, these are just enumerations and not all that difficult to implement and support.
I may have misunderstood that, but it sounded like the job would be just a piece of entered data. For a role that makes sense (but can also be subject to a need for localization, e.g. in the case of children’s films that frequently have dubs and where characters will often have different names than in the original; the Harry Potter films are an especially good rxample of this). But for jobs that just makes for a messy database.
Having the jobs be a database entity (like instruments and areas) is fine (so no schema change needed for adding them). And having an Other where the UI would enforce the addition of extra information sounds goid too (avoids the “add an annotation” solution used by MusicBrainz).
I was simply presenting a conceptual construct without addressing how “Dancer” would be represented. Perhaps I should have said <"Joe Dancer, “My Dance Video”, DANCER>. The actual mechanism we use to represent jobs in the database wasn’t addressed in that response. I do in fact support the idea of a taxonomy of jobs that is controlled and that supports localization. In fact, my line: “There will certainly be many discussions on what kinds of relationships there should be”; and “these are just enumerations” seems to support the idea that these are NOT just user data entries.
I’m in the process of making yet another media streamer/organizer/blah app myself and would love to see a central place to get good JSON metadata from an API. While the big three (themoviedb, tvdb and tvmaze) have decent data they all have their little quirks. I still don’t get what tvdb’s issue is with shows like WWE and the like. If people are providing the data why would you purge it?
Anywho, some of the limits I see using all the providers.
Not as easy to tie cast/crew together via uuid and the like. Therefore I have alot of “duplicate” rows containing person data. Having 600k rows isn’t a big deal but not having matching id’s makes “also in…” queries problematic.
Lack of sporting data…not really interested in game stats like yards, players, etc. But who played what/when and final score seems reasonable.