GSoc 2018: Importing Data into BookBrainz from Third-party Sources

SkullTech · March 17, 2018, 6:21pm

Google Summer of Code Proposal

Name. Sumit Ghosh
MB Username + IRC Nickname. SkullTech
Github. SkullTech
Website + Blog. https://sumit-ghosh.com
E-mail. sumit@sumit-ghosh.com
Time-zone. UTC +5:30

Importing data into BookBrainz from third-party sources

Project Overview

As of now, BookBrainz only supports adding data to the database manually through the web interface. Building a database of considerable size only by this method would be highly inefficient. To solve this problem, we could develop a complete system for importing third-party data into the BookBrainz database easily.

The following text taken from here sheds light on the plan developed my Leftmost and Sputnik regarding this matter.

At last year’s summit, the two BookBrainz lead developers, Leftmost and LordSputnik worked on a plan for importing third-party data into BookBrainz. This plan has several stages. First, data sources need to be identified, including mass import sources with freely available data, such as libraries, and manual import sources, such as online bookstores and other user-contributed databases. The next stage is to update the database to introduce an “Import” object, which can be used to distinguish mass imported data from (usually better quality) user contributions. Then, actual import bots for mass import and userscripts for manual import will need to be written. Finally, it would desirable (but not necessary if time is short) to introduce an interface to the BookBrainz site to allow users to review automatically imported data, and approve it.

As mentioned in the text above, we can work on this project dividing it into a few distinct stages.

Updating the database schema.
Designing and creating the pipeline for importing and inserting data into the DB.
Identifying data sources and writing the importer for those sources.
(if time permits) Making a web interface for users to review conflicts.

Updating the database schema

On the database side, a new table for storing the sources will be created. It will store an identifying name of the source, along with corresponding URL and description. A rough schema of the table is outlined below.

CREATE TABLE bookbrainz.source (
	id SERIAL PRIMARY KEY,
	name VARCHAR(64) NOT NULL UNIQUE CHECK (name <> ''),
	url VARCHAR(64) NOT NULL DEFAULT '',
	description TEXT NOT NULL DEFAULT '',
);

As we will be creating new entities from the imported data, we will need to specify the source in the stored data. For that, we will be modifying the entity table, adding two additional fields to it. One of the fields will indicate whether the entity is an imported one or user-contributed one. The other will point to the source in the case it is an imported one. The corresponding schema is outlined below.

CREATE TABLE bookbrainz.entity (
	bbid UUID PRIMARY KEY DEFAULT public.uuid_generate_v4(),
	type bookbrainz.entity_type NOT NULL,
	imported BIT NOT NULL DEFAULT 0,
	source_id INT,
	CHECK ((source_id IS NOT NULL OR OR imported IS 0) AND (source_id IS NULL OR imported IS 1))
);
ALTER TABLE bookbrainz.entity ADD FOREIGN KEY (source_id) REFERENCES bookbrainz.source (id);

Here we added a check statement in the definition of entity, which takes care of the interdependent constraints between source_id and imported.

Designing the data import pipeline

We will be creating a modular structure for the importer, so that new sources and the importer for that can be added later quite easily. The basic project structure will be somewhat like below.

.
├── dumps
│   └── librarything-2018-03-15T20:34:17+00:00.json
├── importers
│   ├── librarything.py
│   └── openlibrary.py
└── insert.py

Here, the scripts inside the importers directory import data from the corresponding source and then dump it in a specified format in a JSON file inside the folder dumps. After that, on running the insert.py script, it inserts the data dumped into the database. An example command to run it would be like the following

$ python insert.py -d librarything-2018-03-15T20:34:17+00:00.json

Things to note here.

The JSON files inside the dumps folder are of very well-specified format and are machine-generated.
The importer scripts are free to import the data in whatever way they prefer. Possibilities include
- Downloading and then reading a data dump file, such as the ones provided here by OpenLibrary.
- Accessing a web API.
- Scraping the content off web-pages. (Very unreliable, would avoid).
The modularity of this approach lies in the fact that the insert.py script does not care how the data was imported, it just reads it from the dump JSON file, and inserts it into the database.

Handling duplicates and conflicts

While inserting entries into the database from the data dumps, we will have to take care of the following

Avoid duplicate entries
Resolve conflicting entries

For this, the insert.py script can do the following. Before inserting, check if the database already has an entry for the same item. If it has, and if the data is conflicting (i.e the values stored in the database and the values imported do not match), save it separately for manual confirmation by a user. It can either be stored in a separate table in the database, or in a JSON file (something like conflicts.json).

Identifying data sources and writing importers

This step would involve researching about available sources for data, having a permissive license. Writing the importer scripts for those sources will have to done parallelly with this. Also, this stage can continue indefinitely, i.e new developers joining the project later on, can keep extending this part of the pipeline.

Infrastructure to be used.

We will be writing all the scripts (importers and insert.py) in Python.
For CLI facilities of the scripts, the argparse library will be used.
For interaction with the database, we can use something like peewee. It’s a simple and clean ORM for Python. If it turns out to be inadequate, we can fall back to SQLAlchemy.
For writing test suites, pytest could be used.
For documentation, Sphinx along with Read the Docs is an excellent solution.
For any potential scraping and crawling, the Scrapy framework can be used.

Timeline breakdown

The total coding time for SoC is 12 weeks, divided into 3 phases.

Timeline

Week 1. Updating the database schema.
Weeks 2 to 3. Designing and creating the data import pipeline.
Week 4. Writing test cases and documentation for the above.

Phase 1 evaluation.

Weeks 5 to 8. Identifying data sources and writing importers. Also writing the test cases and documentation for these parallely. We would try to cover at least a data source per week, so that we can write 4 code for 4 data sources at least by the end of this phase.

Phase 2 evaluation.

Weeks 9 to 10. In case there is some backlog, complete the work. Otherwise, make a web interface for users to review conflicts.
Weeks 11 to 12. Finishing up the project. Making sure the codebase is complete with test-cases and documentation. Making sure the code is deployable in production.

Final evaluation.

Beyond. Keep writing code for various MetaBrainz projects!

About Me

I am a Computer Science major undergrad studying at IIT Delhi. Check out my resume here.

Question: Do you have Github, Twitter or any other public profiles you think we might find interesting?
Answer: Yes, I do have some public profiles I think you might find interesting.

Github: I put up most of the projects I work on here. I built many of these for small freelancing gigs that I did through Upwork.
Personal Website: My blog, double-serving as my personal website.
Twitter: Disclaimer, I am not an avid Twitter user.
LinkedIn: For jobs and internships and stuff. The chore.
Goodreads: I wanted to put this because I’m an avid reader, and that’s one of the primary motivation I want to contribute to BookBrainz. I share all of my reading activities here on my Goodreads account, so you might get to know my taste from there.

Question: Have you contributed to other Open Source projects? If so, which projects and can we see some of your code?
Answer: Yes I have contributed to some open source projects. Listing them below.

MusicBrainz Picard Website. This is the project through which I got exposed to the MetaBrainz developer community for the first time, almost a year ago. Here are my pull requests.
Mozilla’s Webcompat Project. Below are the Mozilla repos related to their webcompat project to which I’ve made contributions recently.
- Webcompat Issue Parser. This is a tool for downloading and parsing the webcompat issues. Here are my pull requests.
- Autowebcompat. This project aims to automatically detect web compatibility issues using Selenium and Neural Networks. Here are my pull requests.
ProxyMan. This is a handy tool to configure proxy setting on Linux easily. Here are my pull requests.

Question: What sorts of programming projects have you done on your own time?
Answer: I like to make small but useful projects using the concepts I learn to turn the newly acquired knowledge into something useful. Also along the way I usually end up learning a bunch of new things, so that’s a bonus! Here are some of the projects I recently done on my own time.

PearSend. A simple CLI client for peer-to-peer file or message sending, written in Python. Implements protection against transmission error using CRC32 checksum.
Goodgives (WIP). A CLI tool + web-app for auto entering giveaways in Goodreads. I had always been a fan of books, and who wouldn’t love some free books!
Journalist. An app to write journal digitally. It lets you write your journal using Markdown in your favorite text-editor and view the journals (Markdown rendered in HTML) in browser. It stores written journals in a comprehensive directory structure.

Other than these, I’ve made a lot of small CLI apps and web-apps, for freelance gigs or just simply learning. Check out my Github for the full list.

Question: Tell us about the computer(s) you have available for working on your SoC project!
Answer: I have an ASUS Laptop (ZenBook UX430UA). It’s having an Intel Core i7 processor, 8 GB RAM and 256 GB SSD. I dual-boot Windows 10 and Ubuntu 17.10 on it, doing all the programming related work on Ubuntu. I also use DigitalOcean VPS’s for running long-running scripts and testing purposes.

Question: When did you first start programming?
Answer: I was introduced to programming at the age of 12, through QBASIC, in my secondary school. From my first exposure, I was hooked, but not having a personal computer was a inhibiting factor. Later, when I was around 15, I got a computer, and have been programming passionately ever since.

Question: What type of music do you listen to?
Answer: I love to listen all kinds of music, but in particular I have a weakness for alternative and psychedelic rock. Some of my favorite rock bands are Pink Floyd, Muse and Coldplay. I also like listening to Hip-hop and R&B once in a while, for example artists like Kendrick Lamar, Eminem, and The Weeknd are some of my favorites too.

Question: What type of books do you read?
Answer: Recently I’ve been reading a lot of non-fiction, especially those related to philosophy, psychology, self-help, business, finance, etc, some of them being: Man’s Search for Meaning, The Mythical Man-Month, Homo Deus: A Brief History of Tomorrow, etc. Among fiction, I recently read The Stranger by Albert Camus and 1984 by George Orwell, I loved both of them!

I read a lot of fiction when I was younger, especially high-fantasy works such as the Harry Potter series, the Narnia series, and The Hobbit. I also read and loved all the books by Michael Crichton, Dan Brown, and Indian author Satyajit Roy.

Question: What aspects of BookBrainz interest you the most?
Answer: Most of the already available book databases don’t link to other databases. But BookBrainz does, through the different identifiers, such as Wikidata ID, ISBN, ISNI, VINF, Libraything ID, etc. I think this aspect of having a singular database which will link to all the different already existing database is exciting. Also, giving the users power to add and edit data makes it a community driven project, that is a interesting concept, different than something like Librarything. (although similar to Wikipedia and Wikidata).

Question: Have you ever used MusicBrainz to tag your files?
Answer: Yes, I have a large collection of MP3 files. And I use MusicBrainz Picard to properly tag them.

Question: How much time do you have available, and how would you plan to use it?
Answer: The summer vacation at my college starts from May 18, which almost exactly coincides with the starting of the official coding period, i.e May 18. So I will be available full-time throughout the SoC period. I plan to work at least 8 hours every weekday, and at the weekends I plan to work about 4 hours per day. Possibly I will give many more hours into this project, as I won’t be having anything else to do.

Question: Do you plan to have a job or study during the summer in conjunction with Summer of Code?
Answer: No, if I get selected for SoC I won’t be doing any other job or studying this summer.

SkullTech · March 17, 2018, 6:23pm

This is a draft of my proposal for GSoC 2018. Feedbacks and suggestions for improvement will be very helpful.

LordSputnik · March 21, 2018, 8:22am

Hey SkullTech, sorry it’s taken so long to get you feedback, hopefully should be able to give you a proper review today. Thanks for waiting!

LordSputnik · March 21, 2018, 9:37pm

SkullTech:

CREATE TABLE bookbrainz.source (
	id SERIAL PRIMARY KEY,
	name VARCHAR(64) NOT NULL UNIQUE CHECK (name <> ''),
	url VARCHAR(64) NOT NULL DEFAULT '',
	description TEXT NOT NULL DEFAULT '',
);

Good idea, we will definitely want some way of tracking the sources of data. I’d probably say use TEXT for the urls and names, since we have no idea of the length of these fields in advance. Perhaps VARCHAR(255) could work for name, but I would definitely use TEXT for the url.

SkullTech:

For that, we will be modifying the entity table, adding two additional fields to it. One of the fields will indicate whether the entity is an imported one or user-contributed one. The other will point to the source in the case it is an imported one. The corresponding schema is outlined below.
CREATE TABLE bookbrainz.entity (
	bbid UUID PRIMARY KEY DEFAULT public.uuid_generate_v4(),
	type bookbrainz.entity_type NOT NULL,
	imported BIT NOT NULL DEFAULT 0,
	source_id INT,
	CHECK ((source_id IS NOT NULL OR OR imported IS 0) AND (source_id IS NULL OR imported IS 1))
);
ALTER TABLE bookbrainz.entity ADD FOREIGN KEY (source_id) REFERENCES bookbrainz.source (id);

I don’t like the idea of modifying the entity table. Currently, all entities can be linked to revisions through the header tables, and the revision links the entity to its versioned data. However, revisions can only be created by editors, and have an “author_id” field, which is a FK to the editor table. Instead of trying to modify this whole system to allow authorless revisions, It would probably be easier to create an “import” table which link to data directly through a set of “_import” tables. This keeps the entity system exactly as it is now, ensuring that nothing breaks on the site because of the modification, and also ensures the new “source” field is only added where it’s needed (and removes the need for an “imported” field). Let me know if you’ve already worked out how to get around this problem with your current plan.

I’d prefer not having to generate the JSON files as an intermediate step. Having all data in the database ensures that we only have to keep backups of one thing, and reduces the complexity of the system. I think you’d be able to have very similar modularity, by just making the “insert.py” into a python package which can be called by the individual importer scripts.

How will it determine if an entry is “the same”? This approach also has the problem that we’re mixing good, user-added data, and unknown-quality imported data - it’s important to be able to distinguish between these, so updating existing entities based on imported data without human checking is a bit risky.

Rather than do this, I’d recommend focusing on detecting potential matches, and working out how to store this information in the database. We can then add an interface allowing editors to view the potential matches when they’re reviewing the imported data, and optionally merge the imported data into one of the matched entities.

Could also consider “click”, but argparse is fine for this too. You’ll be using Python 3, I assume?

Hmm. I’d prefer avoiding an ORM for this - it would be nicer if we could just use plain old SQL, created by a set of functions - this is the approach we’re moving to for the site ↔ db connection, after starting out with an ORM, as ORMs are mostly unnecessary and can also be quite inefficient, as you’re limited to accessing data in a particular way. That said, I don’t feel too strongly about this, it’s just a recommendation.

Sites may not like us doing automated scraping of their pages, which is why the userscript method was preferred for getting data where an API or data dumps aren’t made available. How much experience do you have with writing JavaScript, as that would also be useful for writing the user-facing aspects of the import system?

I’d like to see more detail in your timeline. Please expand on what you’ll be doing for each of the weeks, and explain each point a bit more.

I think this would be a good task to get completed during the community bonding phase and first few weeks of the project.

These 4 weeks seem to essentially just be completing the stuff you started in the first 8 weeks. Please try to estimate how long each aspect will take with more accuracy, and then be more specific about when each aspect of the project will be completed. You’ve said in week 4 you’ll be writing test cases and documentation for the data import pipeline, so what will you be writing tests and documentation for these last two weeks?

Great to hear!!

Nice taste in books and music too!

I like this too, and think it’s an important aspect of BookBrainz. We’re hoping to expand this in the future, adding more identifiers and having identifiers as links rather than just text.

Overall, I’d like to see more detail in this proposal. Please think about the technical challenges you’re likely to encounter (text encoding, field format conversion, resuming interrupted imports and incremental imports are some examples), and then explain how you’re planning to solve them. Please also consider the suggestions I’ve made about the architecture of the import schema, and see whether you agree. Maybe also include some example queries that will be executed when data is first imported, and then when an editor reviews it and updates the data.

SkullTech · March 25, 2018, 1:52pm

@LordSputnik Thank you very much for the feedback! I will modify my proposal keeping your input in mind soon, at the moment I have an exam going on at my Uni, Sorry for the delay.