GSoC 2022: Import the (now defunct) Bookogs database

asymmentric · April 11, 2022, 6:28pm

GSoC 2022: Import the (now defunct) Bookogs database

Name: Arun Maurya

IRC: asymmentric

Github: asymmentric

Email: arunmaurya621@gmail.com

LinkedIn: Arun Maurya

Timezone: GMT +5:30

Project Objectives

The Bookogs database dumps were made publicly available for download in JSON format right after the closing of the project. In order to prevent the loss of Bookogs contributions and many such different contributions, create a backend system to parse and import data from large scale databases such as SQL/NQL OLAP etc.

Parse large database dump files efficiently
Create adapters to transform entities from one database schema to other

Parsing Database Dumps(Producer)

Problem encountered while importing databases is handling large files which can not be processed directly, keeping in mind the memory limitations. This can be achieved by streaming large files and processing the data into chunks. Data chunks incoming from the files can be converted into Objects to make it easy to manipulate and push it to the message queue.

Inducting entities into BookBrainz database(Consumer)

To effectively use data objects incoming to consumer from the message queue, it should be filtered for creation of SQL templates. Based on filtration of an Object, it should be mapped with the entities

Scripts to filter and scrap objects for entity data.
Insert templates should be created for different entities.

Objects would be inserted into separate entities and would be inducted into main database entities after manual confirmation from editors.

TIMELINE

Pre-GSoc(Before 19th May)

Explore the BookBrainz project to have an even more clear understanding of the codebase.
Set up the environment
Create tickets and solve some.
Discuss with mentor and community members about the project for suggestions.

My PRs: Check Out

Community Bonding Period

Extend mentor and community interaction, and introduce my project to the community
Analyse Bookbrainz schema and understand entity relationships.

Phase 1

Week 1-2

Explore bookogs data dump to understand data organisation and categorisation.
List down types and categories present in bookogs database which are not in BookBrainz and discuss the same with the mentor.

Week 3-4

Setup Zookeeper and Kafka on docker

Week 5

Create parser to extract data from various file formats and convert them into Objects.

Week 6

Discuss with the mentor and community members regarding modification of Bookbrainz database to accommodate for the types,categories and extra data which are not yet possible to add in the existing BB database.

First Evaluation Deliverables

Setup and deploy Kafka on docker
System to efficiently parse large files and push it to message queue

Phase 2

Week 7

Map imported data/entities with that of the main Bookbrainz database to transform imported entities into Bookbrainz entities.

Week 8

Write scripts to filter object data for different entities

Week 9-10

Create insert templates.

Week 11-12

Buffer time for unexpected delay and issues

Final Evaluation Deliverables

System to filter incoming objects and insert them into database accordingly

Post GSoC

Be active in the community and contribute to Bookbrainz to add more features and enhance functionalities
Work on improving Bookbrainz Schema to allow more features.

About me

I am a sophomore pursuing my bachelor’s degree in Computer Science from Vidyavardhaka College of Engineering, Mysore. I have a background in backend development and like to play around with APIs a lot. I have a tendency to learn about how things work. I love building stuff that channels innovation.

I am working consistently to contribute to the community and enjoying the process very much.

Other Information

Tell us about the computer(s) you have available for working on your SoC project!

Ideapad Slim 5
Processor: Ryzen 7 5000 series
RAM: 16 GB

When did you first start programming ?

It has been 4 years since I started programming.

What type of music do you listen to ?

My music taste varies a lot from Nusrat Fateh Ali Khan, Manoj Tiwari to OneRepublic, The Chainsmokers and Juice Wrld, Drake and The Weeknd.

What type of books do you read ?

I enjoy reading Hindi Poetry and Non-Fiction novels

What aspects of the project you’re applying for interest you the most?

What I like most about BookBrainz and the community, is its will to preserve the contributions and let it be available to all.

Have you ever used MusicBrainz to tag your files?

As of now I have not but I’d be doing it in future for sure.

If you have not contributed to open source projects, do you have other code we can look at?

Yes, I’ve various projects like Unexplored Places API, Beer Search, Educational ERP and many other projects which are available on my GitHub.

What sorts of programming projects have you done on your own time?

I have done some IoT projects like automating RC Cars and UAVs for fun.

How much time do you have available, and how would you plan to use it?

I plan to dedicate 28+ hours per week.

asymmentric · April 11, 2022, 6:54pm

hey @mr_monkey
please review it when you’re free and provide suggestions for further improvements which can be made.

mr_monkey · April 17, 2022, 6:50pm

Hello @asymmentric !
Thank you for your proposal, it looks good

My concerns at this point are more that I have not seem a great deal of contributions from you to testify of your understanding of the codebase and schema, which can be quite complex at times but crucial for this project.

We have in place special entities for importing (search for “_import” in the schema file), allowing to show them apart clearly in the UI and await a manual confirmation.
So a separate imports schema shouldn’t be needed.

They are basically copies of regular entities, with an extra id.
We should be able with that id to have a system to re-run imports without creating duplicates. This would allow updating outdated information.

We also need to think about infrastructure; where is this going to run? How do we make it stable, resilient and scalable?
I would imagine for example that using RabbitMQ or some such system would be useful to ensure processes can be stopped without having to start from the beginning.
If we wanted to import from another format (MARC for example), how could we reuse this infrastructure? How could we make it more modular?

asymmentric · April 18, 2022, 5:37pm

@mr_monkey Thankyou for the suggestions.
I will have to change the approach a bit,

I am planning to use Kafka keeping in mind the scalability, and deploy it on docker.

I’m currently working on the ways of parsing MARC files, which can be converted to Objects and insertion operations can be performed.
Same can achieved for CSV files as well.
Final objective while parsing not only JSON files but all is to extract/convert data into objects.

I’ll make the proposed changes in the mock-up as well as the proposal.

asymmentric · April 18, 2022, 5:41pm

I’ve been through the codebase and schema thoroughly but have not been able to contribute due to my University exams. I have taken note of few tickets and will be working on them as my exams will get over in 2 days.