GSoC 2018: Importing data into BookBrainz

dataimporting
gsoc
bookbrainz
gsoc-2018
Tags: #<Tag:0x00007fde0c393428> #<Tag:0x00007fde0c3932e8> #<Tag:0x00007fde0c392bb8> #<Tag:0x00007fde0c392a28>

#1

BookBrainz: Data Importing


Personal information

Nickname: shivamt
IRC nick: bukwurm
Email: shivamtripathi0108@gmail.com
GitHub: https://github.com/shivam-tripathi
Twitter: aShivamTripathi
Blog: shivam-tripathi.github.io


Overview

This proposal is aimed at importing data into the BookBrainz database from third party sources, while at the same time ensuring that the quality of the data is maintained.

Data sources

Data can be possibly imported into BookBrainz in the following two major ways:

  1. Mass import the data using the third party data dumps/APIs
  2. Manual imports by users from various sources like online bookstores, review sites etc.

In this project, I will mass import the data from openlibrary.org, and if there is time left in the import phase I will work on Library of Congress dumps. The licensing information regarding the dumps can be found here and here. To aid the manual imports, I will write userscripts for goodreads.com and bookdepository.com if time permits, for amazon.com or any other source metabrainz community prefers.

Maintaining the quality of the imported data

To ensure that the quality of the data is maintained, each imported value will have to validated by the BookBrainz editors. To do this, the imported data is not added right away into the entity object in the database, but instead saved as an import object. Unlike the entity object, the import object cannot be revised - and can only be either promoted to an corresponding entity type or deleted. This promotion to an entity object is done after an approval from the editors. Similarly, it is deleted if discarded by the editors.

Proposed data flow

The data imported travels from source to finally become one of the entities in the BookBrainz database. This data life cycle has following steps:

  • Data sources
          These are the data dumps and the various websites that have the data in the raw form, which needs to be cleaned up and moulded according to the BookBrainz database.
  • The import object
          Once cleaned up, the data will be inserted into the database as an import object. This object cannot be revised, and simply serves as a temporary store of data before it is either accepted or discarded.
  • The entities object
          The imports data is presented to the user on the BookBrainz-site. The editor now makes one of the following choices:
    • Approves the import
      This then creates a new entity of the corresponding type in the database.
    • Discards the import
      In such a case the import data is deleted from database.
    • Decides to make some changes and then approve
      In such a case, the editor is presented a form with pre-filled details, using which s/he can edit the details of the import object. Once satisfied with the details, the editor can either promote the import to entity or cancel the operation to return back.

The completion of the third step marks the end of the data flow. The proposed data flow could be better understood by the following diagram:

Figure: The imported data life-cycle


Implementation

The data to be imported has two attributes:

  • Structure
    How the data would reside inside the database.
  • Behavior
    How the data would flow, interact with the editors and how it’s states would change from import to an entity.

Milestone 1

Milestone 1 would involve working on definition of the structure of the data that is imported. This involves properly working out schema design and then making relevant changes to bookbrainz-sql and bookbrainz-data code.

The proposed imports object is similar to the entities, except the fact it cannot be revised - it can only be upgraded to the status of entities or deleted from the database. Some of the attributes the import object may hold are:

  • Id
  • Votes by the editors to discard/approve the imported data
  • Source of the imported data (for example, goodreads.com or openlibrary.com)
  • Date added

Roughly, presently each entity has entitytype_header, entitytype_revision and entitytype_data table, which respectively store the relevant information about the present master revision, all entity revisions and the complete entity data per revision. The proposed new import object cannot be revised, so it will not need entitytype_revision table. Also, there can be no master revision, so entitytype_header table is also not required by the import object. Only the entitytype_data table is required, and the existing table can be used for the same.

Therefore, the changes to the schema roughly could be:

  1. Addition of a table import with the following fields:
    a. id: Incremental index
    b. type: The type (string) of the imported entity, for example creator, work, publisher etc.
    c. date_added: Date of import
    d. source: Source of data (string)
  2. Addition of tables entitytype_import, where entitytype signifies the type of the entity (namely creator, work, edition, publisher and publication) with the following fields:
    a. import_id : The foreign key to table import(id)
    b. data_id : The foreign key to table *_data(id)
  3. Addition of table discard_votes which stores all the votes casted to discard an entity. a. id: Incremental index
    b. import_id: Id of the import type
    c. editor_id: Id of the editor who cast his/her vote
    d. vote_number: The sequence number of the vote cast
    e. date_voted: Date of casting the vote
    The discard_vote table ensures that the no two votes cast to discard the import are made by the same editor. This is done by making the tuple (import_id, editor_id) primary key.

schemaChange
Figure: Overview of proposed schema change

Here, the use of discard_votes in the table is based on the conservative approach to changing the state of an imported data which I will be explaining in detail later in this proposal.

Code structure

For the schema change, most of the code would reside inside bookbrainz-sql/schemas/bookbrainz.sql file. Commands to do this can be (creating tables and views):

CREATE TABLE bookbrainz.import (
	id SERIAL PRIMARY KEY,
	type bookbrainz.entity_type NOT NULL,
    date_added DATE,
    source TEXT NOT NULL DEFAULT '',
    voter_list 
);

/* Commands for creator import */
CREATE TABLE bookbrainz.import_creator (
	import_id INT PRIMARY KEY,
	data_id INT
);

/* Commands for voting table */
CREATE TABLE bookbrainz.discard_votes (
    import_id INT,
    editor_id INT,
    date_voted DATE,
    PRIMARY KEY (
		import_id,
		editor_id
	)
);

/* Foreign keys */
ALTER TABLE bookbrainz.import_creator ADD FOREIGN KEY (import_id) REFERENCES bookbrainz.import (id);
ALTER TABLE bookbrainz.import_creator ADD FOREIGN KEY (data_id) REFERENCES bookbrainz.creator_data (id);
ALTER TABLE bookbrainz.discard_votes ADD FOREIGN KEY (import_id) REFERENCES bookbrainz.import (id);
ALTER TABLE bookbrainz.discard_votes ADD FOREIGN KEY (editor_id) REFERENCES bookbrainz.editor (id);

/* Sample view for creator */
CREATE VIEW bookbrainz.import_creator_view AS
	SELECT
		i.id, cd.id AS data_id, cd.annotation_id, cd.disambiguation_id, als.default_alias_id, cd.begin_year, cd.begin_month, cd.begin_day, cd.begin_area_id, cd.end_year, cd.end_month, cd.end_day, cd.end_area_id, cd.ended, cd.area_id, cd.gender_id, cd.type_id, cd.alias_set_id, cd.identifier_set_id, cd.relationship_set_id, i.type, i.date_added, i.source, i.discard_votes
	FROM bookbrainz.import_creator ci
	LEFT JOIN bookbrainz.import i ON i.id = ci.import_id
	LEFT JOIN bookbrainz.creator_data cd ON ci.data_id = cd.id
	LEFT JOIN bookbrainz.alias_set als ON cd.alias_set_id = als.id
	WHERE i.type = 'Creator';

Case of edition_data

Edition includes fields containing BBID of publication. This would require some rewiring, or creating a separate table to store edition data.
There are two possible solutions to the problem:

  • Make publication_bbid field nullable for all imports, and enforce that it has to be non-null in the code for entities. This essentially means that the imported editions will not have publications attached. The editors can add them when upgrading to entity in the edit and approve page.
  • Use UUIDs for imports as well, and add a separate column in the edition_data table to mark type of entry publication_object_entity - if it’s an import or if it’s entity. Upon upgradation, assign the same UUID of the import to the entity as well, and update all publication_object_type to entity from import.

Keeping other relevant data

While importing the data, we may come across some relevant information about the objects which we cannot introduce yet into the import data in the database due to it lacking the status of a full entity. For example, import objects will have no relationships (as proposed). They will exists as single standalone objects, which will gain all the priviliges once they are upgraded to the entity. Any form of relationship exiting in the dump object will be lost.
To avoid this, a separate table could be constructed containing text field only which will hold all the metadata extracted out from the dump in form of a json field. Postgres 9.4 has the ability to store JSON as “Binary JSON” (or JSONB), which strips out insignificant whitespace (with a tiny bit of overhead when inserting data) but provides a huge benefit when querying it - indexes.
It needs to be decided how the to keep this metadata table linked to the object after it is promoted to state of an entity.
An example could be:

CREATE TABLE import_metadata (
  import_id INT,  /* For imports */
  entity_id UUID, /* For promoted entities */
  import_id integer NOT NULL,
  data jsonb
);

/* Querying */
SELECT data->>'publisher' AS name FROM import_metadata;
/* Output */
name
----------------
IGC Code
International Code of Signals 
IAMSAR Manual Volume III
Nautical Charts & Publications
(4 rows)

Interacting with the import data

The bookbrainz-data module will have to be updated with functions to access and modify import objects. As the present direction of the module is to write separate functions per query returning Immutable.js objects, I will follow that approach. This will also aid in testing, as results of each function could then be easily tested. Also, it will help to write pure SQL queries to interact with database. Roughly, functions for accessing different import objects, upgrading an import object, deleting an import object and adding a new import object will require to be added. Apart from that, handling addition tables like discard_votes etc. will have to be managed.


Milestone 2

The second step would be to import actual data into the BookBrainz database. I will be utilising openlibrary.org and Library of Congress data dumps for this purpose. For this, I will write scripts to read the data, clean it up and then execute SQL queries to insert data from the dumps into the database. To facilitate bulk insertion of data, I will consider the best practices to populate the database. I will cover some of them in more detail later in this proposal.

Handling large dump files

Dumps upon unzipping can easily acquire enormous sizes. For example, edition data from OL is a TSV file which exceeds 30GB upon expanding. To manage this, I propose to break down each file into small manageable chunks (of sizes around ~250MB to ~500MB) and run the code separately on each of them. As each line containing is very much independent of the other, we can split the entire document easily line by line. This can be done easily in UNIX system using:

split -l [number of lines] source_file dest_file

This will split the source_file into multiple files with dest_file prefix. Once converted into manageable chunks, we can extract the data from them one by one, transform them into the desired format and then load them into the database.

Approaching dumps

One of the first thing to do would be to know what all fields exist in a particular dump. Once clear about the various fields in the dumps, it could be figured out how to classify the various data fields and fit them into the BookBrainz data model. This would also help in understanding the edge cases and different forms of data that could creep in.
This would greatly help in validating the data, and testing the validation functions.

Structure of the dumps

In this section I will analyze the structure of the various dumps.

  • Open Library
    The dumps are generated monthly, and have been divided into the following three subparts for convenience

  • Library of Congress

    • Books All (A series of 41 dumps)
      • Size: ~500 MB to 800 MB per dump
      • Relevant xml in the data dump:
        books_all.xml
    • Classification
    • … Other dumps

Code structure

Theoretically, the entire toolkit to import data can be separated into two parts:

  1. bookbrainz.import.generate_import
    Data dump specific tools, which cleans up the data into a predetermined format.
  2. bookbrainz.import.push_import
    A tool which works on a predetermined data format, validates it and imports the data into the database.

In a nutshell, the role of the bookbrainz.import.push_import can be summarised as follows:

  • Validate individual data fields
  • Encapsulate the data field into data objects
  • Validate the data objects
  • Push the data object into the database

The role of bookbrainz.import.generate_import can be summarised as:

  • Read each text files of the specific dump.
  • Extract relevant data field from the text file
  • Construct a generic object from the field and pass it to sub-module of bookbrainz.import.push_import to construct the import object.

The bookbrainz.import.push_import tool could readily be reused for importing other dumps in future also. One of the breakdowns possible is developing bookbrainz.import.push_import within the bookbrainz-data already existing as a npm module, which could then be used for reading and pushing data into the BB database the data generated by bookbrainz.import.generate_import.
Another case can be to write a separate module in javascript or any other language which could be used by all bookbrainz.import.generate_import scripts.

I plan to write bookbrainz.import.push_import using javascript, containing the following subparts (roughly):

bookbrainz.import.push_import

  • validators
    * common (including for encapsulated objects)
    * validators specific to each import type
  • create_objects
    * create import objects of each type using the validators per field
  • push_objects
    * push the constructed valid object into the database

This can be placed either inside the bookbrainz-data module or it can exist as an separate module. Using javascript here would automatically mean using node to read data from the dumps.
The dump facing scripts would require to include the bookbrainz.import.push_import module to perform validation and construction of import objects. As it mostly depend on the kind of dump being handled presently, a generic module structure (based on logic flow) could be:

bookbrainz.import.generate_import

  • read data: Run the script to input each text file.
  • extract data: Extract the required data fields from the objects
  • construct a generic object: Construct the generic object to be used to contruct the import object.
  • construct and push import object: Call the relevant submodules from the push_data module to construct the relevant import object and push it into the database.

Validators and testing

The bookbrainz.import.push_import module involves validating the individual data to form a complete valid import entity. This requires the validation functions to be robust to the type of data thrown at them. Thus, thorough testing needs to be done to ensure that it can handle any form of data we will throw at it.
Writing validators will be greatly helped by the existing bookbrainz-site/src/client/entity-editor/validators. All these validators would require unit testing to check if they perform as they are intended to.
Secondly, we need to check also if the insertion of imports object occurs as required. This would require populating the bookbrainz-test database using sample input data and then push_objects submodule, and then querying it to check if the final value inside the database is as expected. Similarly, other unit tests can be written to ensure that the system is robust to different forms of data and performs as it’s intended to.

One of the most necessary part of testing the validators would be the sample prototype dataset we would use as arguments. This dataset is to contain all corner cases so as to ensure that the validators are thorougly tested and they do not pass any invalid data into the system. It would be build by sampling and adding some extra records which cover all different forms of values the data can take. It is necessary to work on this thoroughly, as the tests written presently would not just work on dumps presently being using to import, but also the future ones.

Bulk populating the database

Some techniques to bulk import data can be:

  1. Creating a batch sequence of sql queries and executing them after a fixed batch size. In this case, I will disable autocommits and commit at the end of the batch. In case there is some error, the entire batch will be redone.
  2. Write the validated imported objects to a flat file per dump subset, and then use COPY. Since it is a single command, it does not need to disable autocommit.
  3. Before COPYing the data, dropping foreign key constraints and indexes and then importing the data leads to faster data load. Later the contraints can be added back. It would be programmatically ensured that the data being imported satisfies the constraints while creating the import object. The recreation of foreign key constraints and indexing can be greatly sped up by increasing maintainance_work_mem configuration variable.
  4. Run ANALYZE after altering the data contents of a table. ANALYZE (or VACUUM ANALYZE) ensures that the planner has up-to-date statistics about the table.

Import Endpoint

I will add an import endpoint route to the site, which will come in handy when adding the data via HTTP POST requests. Later if an API is developed, this end point can be removed. This will be crucial with respect to the implementation of various ways by which one tries to send the data object to this end point, requiring the data sent to become an import instead of an entity. Building it would involve designing a fixed data object which encapsulates all the necessary details for building any one or multiple import or entity object. This would require changes in bookbrainz-site/src/server/routes and bookbrainz-site/src/server/helpers modules.


Milestone 3

The third milestone will be to add the behaviour of the data. Once inside the database, the import objects have to travel to the editors via the bookbrainz-site so that they could reach the next stage of it’s life cycle.
I propose to do this via two means:

  • Show up the imported data in the search results
  • Separately allow the users to see and choose to review what all data has been imported and is waiting for review.

Search results

Presently, BookBrainz uses Elasticsearch to facilitate search. I will update the Elasticsearch indexing so that imported objects appear in the search results on the bookbrainz-site. The front-end on the website will be modified to reflect these changes, so that upon search the users of the site would know if the search result is one added by another editor or if it has been automatically added and awaits review. One of the plausible ways to do this can be:

Image: Proposed search results page with imported data highlighted

Upon clicking an imported data, the user will be directed to the imported entity page.

Elasticsearch indexing update would need updating generateIndex function in bookbrainz-site/src/server/helpers/search.js which would add the imports when indexing the ES after a restart.
Also, separate functions would be required to insert new imports and delete the approved imports from the Elasticsearch index. The function to insert new imports into the index would also be useful for add import endpoint, so that as soon as an import is added to the database, it also is added in the ES index.

Imported entity page

This is the landing page of the imported data. It will be very similar to the regular entity, except with the following options instead:

  • Approve
  • Edit and approve
  • Discard
  • Data source
    One possible way to implement the page can be:

landing
Image: The proposed ‘Imported entity’ page

Changing state of an import

As importing of data takes time and effort and discarding an import leads to permanent deletion, I propose keeping a conservative approach towards discarding an import, but being liberal while upgrading, i.e. we make it easy for an import to get included in the database as an entity, but difficult to be removed. I propose that we attach a field ‘votes against’ which is incremented every time an editor says that the import should discarded, and discard an import only if the number of votes reach a threshold. However, we immediately promote an import if an editor marks it approved. Whenever the import object is deleted or upgraded, we delete the search index to ensure it doesn’t turn up in the results. Upon upgradation, the new entity also needs to be inserted into the ES index.

Review page

Another way for the editors to interact with the imported data is to provide them with a page that displays only the imported data. That way, the editors could choose to review imported data from the list. This would include adding a review button at the top, next to ‘create’ button. Upon clicking the ‘review’ button, the user would be directed to the page containing a list imported data, with a limit on imported data shown per page. The user can browse through the list and can choose which import s/he intends to review. Upon clicking any item in the list, the user would be directed to the imported entity page of that particular import.

review

Image: Addition of the ‘Review’ option


Image: The ‘Review’ page

Imported data and non-editors

Presently, the users which are not logged in can search the BookBrainz database view the results. Adding the imported data can now effect this search in two ways:
● Not allow the display of imported data to a user who is not an editor in the search result.
● Allow display of all the data BookBrainz has in the search result. Upon clicking the result, the user will be led to log in page. Only after loggin in can user visit the imported entity page and approve/discard/edit and approve the import.
I would like to go with the latter implementation, unless decided otherwise.


Milestone 4

The last step would be to write userscripts to facilitate seeding of data from third party websites like goodreads, amazon, bookdepository. The import endpoint at the bookbrainz-site mentioned in the first milestone was developed specifically with this in mind. The userscripts written would be run using the tampermonkey or greasemonkey extensions in the user browser.

Functionality

Aim of this tool is to reduce the effort of an editor when s/he decides to add an entity’s information into the database by directly sending over the data to the BookBrainz database. If the user intends to make changes, s/he can do so at any point of time at the bookbrainz-site.

  • When a user opens up a site containing data about books (for example, bookdepository.com), the browser fetches the website content.
  • If the URL of the website is the one on which the userscript has to become active, the userscript starts and collects relevant data from the page (by analyzing the DOM).
  • The userscript then constructs relevant import objects from the collected data
  • The userscript also adds a Import to BB button to the page at any location easily visible.
  • Upon clicking the button, a modal opens up containing a list of all the entities ( creator, work, publisher, publication etc.) which could be imported from the page followed by a button to search the entities on the bookbrainz-site and a button to import the entity.
  • Upon clicking any one search button, a new tab opens up of bookbrainz-site search page with query of the entity’s name.
  • Upon clicking any import button, the site either sends the data directly to the BookBrainz-site or opens up a search page in a new tab with query as name of the entity.

Figure: The sequence diagram on a timeline for the userscripts

The userscripts would be written separately in a different repository with name bookbrainz-scripts and made available to the users with a single click.

Code structure

Roughly, overall code can be broken into following two parts for every userscript:

  1. The part which returns UI components (button, form etc.) to be inserted into the relevant
    URL. This is common to all userscripts.
  2. The part specific to each URL - where to inject the button, from where to get data, how to
    construct the data object etc.
    The first part will be implemented separately. It can be included in all the all the userscripts either by bundling them together, or // @require them in the code. I intend to use React to build the relevant components. It is unnecessary to use any bundler, but it could be beneficial in case of writing more modular ES6 format code.

More specifically, the code can be split up as:

bookbrainz-userscripts

  • src
    • lib
      • UI Components
      • Application’s business logic to wrap up the extracted data according to the end point
    • all userscripts

This design is taken from musicbrainz userscripts. It is robust to future additions, as the lib contains the generic userscripts including UI components design and other data cleaning and encapsulation methods, which are used by other scripts. The userscripts are written alongside the lib, and can directly include them in a CDN like manner using \\ @require lib/{script_name}.js tag in the header, if // @namespace is mentioned also.
Any new userscript can be written along side the lib package is it aims at providing site specific tools, or else add it in the lib package which hold all the common case scripts.

Documentation

Schema changes would require proper documentation for future purposes, which I will write in the official developer docs. Most of the remaining work would be in javascript, which will be covered by JSDocs. Additionally, in every buffer week I will spend time to record all the work carried out in the previous phase, which will cover developer side of documentation.
Addition of imported objects is a huge overhaul in the present state of site, and some users (old and new) might easily become confused about how to use the new features. To tackle this, I plan on writing a concise but complete explanation of the new import object in the website in a user friendly manner, which could be added on the website itself (with links to it lying all over the website) or on read-the-docs. Writing proper user documentation is as useful as addition of the new feature, as if the editors themselves are unaware of how to use it, the addition would be rendered useless. One of the ways to easily integrate FAQs into the site with search option embedded could be by using gitbook static site generator, which reads developer friendly markdown to convert it to a static site.
Additionally, I will open up multiple questions on the metabrainz community which answer some common FAQs regarding the new features and imports. I will also work on documenting some of the present features.


Timeline, Alternate Ideas, Detailed Information about yourself.


#2

Maybe a 4th (iv’th?) option: merge into existing entity, which would copy any information missing from the entity already existing in BB (maybe the source has a more detailed birthdate, maybe there are identifiers not in BB already, etc.). This would also require a way to compare the two and, in case of conflicts (e.g., BB says birthyear is 1984, import source says 1985), a way to choose whether to keep foreign or local data.

We have a challenge with not enough of MusicBrainz’s users engaging in voting, which means that far, far most of MusicBrainz edits go through without anyone casting any sort of vote on them. If you make it a precondition for an item to get imported into a “full” BB entity that it needs enough “votes”, you may end up with a system where people will just create the entity directly in BB rather than pulling it in from an import, because it takes so/too long for imports to get moved to the “real” system.

I didn’t see any MBIDs or BBIDs listed in this section… ?


#3

Thank you @Freso for your review! :smiley:

I completely agree with that, and that’s why I propose that the promotion of an import occurs straight away on the very first approval by an editor - it becomes “full” BB entity on the very first vote. However, deleting an import doesn’t occur right away, as that would lead to permanent deletion of the import. Perhaps, if two editors point out that the data is not fit to become a “full” BB entity, and should be discarded - it will be deleted; but not on the very first vote.

That would be really helpful! Perhaps, another button “Merge” alongside “Approve” and others, which would open up a modal requiring one to enter the BBID of the entity to be merged with - which on submission opens the form prefilled with data from both sources. I am not sure how to manage conflicts though. It could be that the editor wishes to keep one piece of data from one source and another from the second. So I’m unsure about it - maybe something like a git diff? :thinking:

Woops! I thought I had converted all the links when I ported from docs to markdown. Fixed now! :sweat_smile:


#4

Just a side note to the above proposal, I am in the process of downloading the dumps and carrying out a rough analysis of the data in them. Once that is complete, I will update the proposal with what type of data would be available, how would we be able to clean them and construct a common data object and where specifically it would go in the bookbrainz-schema. :slight_smile:
All of this would come under the section The structure of the data dumps. For example, a sample entry of openlibrary.authors is as follows:

{
	"name": "Jürgen Siemund",
	"personal_name": "Jürgen Siemund",
	"last_modified": {
		"type": "/type/datetime",
		"value": "2008-08-20T20:07:43.066631"
	},
	"key": "/authors/OL1017592A",
	"type": {
		"key": "/type/author"
	},
	"revision": 2
}

#5

Good choices, I like how you’ve clearly defined where you’re going to import from. Stretch goal: find one other place to mass import from - in particular, it would be good to get a load of data about non-English books.

Agree with @Freso that is would be good to allow editors to merge data. We’d probably want to have the system try to identify matches in the existing data, and then based on that, allow the editor to merge into one of the matched entities. However, it might be best to wait to do this until you can merge existing entities in BookBrainz, since the UI should be very similar. Nice diagram :slight_smile:

You’ll probably need a table of votes. Otherwise, there’s no way of knowing who has voted - the same editor could vote twice to delete the data, and it would just increment the count at the moment.

Probably better to use a text field for the source, if length if not knowable when the table is being designed.

Could do, but I’d like to see a function-based approach to handling this data, as we’re moving in that direction with the data-js package. Functions running custom SQL queries, and returning or receiving Immutable.js objects is the idea we have at the moment.

I don’t think we should allow relationships involving import objects. Let the editor make it a proper entity if they want to create a relationship.

Very, very good point, and something I hadn’t thought of. Please try to think of some solutions (ideally not introducing another table), and if you can, include these in your proposal.

It’s probably going to be difficult to get a prototype dataset that contains all edge cases just from sampling the superset of data. I’d recommend taking a subset, then appending some artificial records with hand-crafted edge cases (you’d want to think about all the weird values that the data could take, and then try each different variety for each field). You could automate this as part of testing.

Where will these be added? What programming language? And why have two tools when you could just move the common stuff to data-js, and then have your source-specific importers interfacing directly with that?

OK, this may not be necessary if we develop a good API as well this summer, and if that’s the case, we might want to adapt this section of the proposal when you get to it. But I think having this planned works well as a backup, in case the API doesn’t get developed. We might also have issues with anonymous data submission through the API.

I like all of this.

I like this too. I think voting is unavoidable here, since we probably don’t just want to lose data forever if a single editor discards it. We should make it so that if a user has discarded an import, it no longer shows up in search results for them, otherwise it might be annoying to see the discarded entity continuing to appear.

I think this is definitely a real problem, but I don’t think it has anything to do with importing - it’s an issue that we have right now with entities. If we don’t allow relationships for imports, then you can just remove this bit from the proposal (maybe turn it into a JIRA?)

Good idea, I like it. Eventually it would be good to have this tailored to the user somehow. Perhaps we can have some sort of filtering mechanism - for example, only show works from OpenLibrary is tagged as science fiction.

If we don’t allow relationships with imported data, showing imports to visitors is less useful. We could perhaps show them up in search results, but have a “log in to do something with this import” link, rather than directing visitors to the unknown-quality imported data.

This is milestone 4! :wink:

At what point in the sequence will the user log in, if they’re logging in to create the new data as an entity rather than an import? Is there a way we can make this persistent, so that the user only has to log in once per session?

Would be good to add some community-based stuff here, like editing BookBrainz/MusicBrainz, hanging out on IRC, talking to BookBrainz users to find out what they’re interested in and whether there are any particular features they’d like to see from your GSoC project. Rest of the timeline looks pretty good.

Overall, this is a fairly solid proposal. However, I’d like to see much more information about how you plan to test the components of the data importing system, and a little bit more about how you’ll document things. As I mentioned inline, generating edge case data and checking that this passes through the importer processing functions as expected would be a useful part of testing. It would also be good if you could include how you plan to structure the new repository for userscripts, bearing in mind we’ll probably have different authors contributing userscripts not necessarily related to data import in the future.


#6

Thanks for the review.

I am presently unsure about how much time it will take to handle each dump. My plan is to deliver with assurance the plan to import data from openlibrary.org. As soon as it gets done, I will move over to the next dump. In my opinion, as much as importing is important, it is important to develop the infrastructure to handle it. That is why I plan to spend no more than 1 month on it (perhaps extend it by couple of weeks if it’s almost done and just needs some finishing) which hopefully would be enough to import maximum dumps. If there is still time after importing targets are complete (in the first phase), I will for sure look up and import other dumps with non-English sources. I am still actively looking for good quality dumps for non-English sources. :slight_smile:

I have tried to make all the changes, fix mistakes and cover everything else in detail. Awaiting round two!


#7

A few things I’ve picked up on on the second read-through:

No need for a a voting number or ID if you’ve already got a composite primary key - this table is just designed to link editors and imports in the form of a vote, much like our other link tables (eg. editor__language).

We could format this nicely and store it in the entity annotation - this is what people normally do in MusicBrainz if they have information that can’t currently be represented by the schema. The annotation is, in one way, kind of like a note from editors to other editors. Your way is probably more space-efficient though, and we could probably display the data in a similar way after processing.

It would be good to be able to re-import based on the saved metadata if we add fields. Then the amount of additional data could be automatically reduced and in some cases eliminated for an entity when enough fields are added to the BookBrainz schema.

What is the interface to bookbrainz.import.generate_import and push_import? CLI? Will there be a program written, and what do you expext the Git repository structure will look like for the importer? Will the generate/push modules live within the same repository as the importer scripts, or with data-js?

Searching for entities seems a bit complex and time-consuming. Why don’t we just do a name conflict check (like you’ve already worked on as a PR), and then get the matched entity if something is found?

Some nice improvements and good detail, but it seems that the text is less clear now. Please spend some time before the final proposal submission going through and making sure everything is clear, consistent and makes sense :slight_smile:


#8

As I hit word limit on the post size, I cannot edit the existing post. The final proposal I submitted can be found at http://shivam-tripathi.github.io/pdf/importing.pdf