GSoC 2022: Import the (now defunct) Bookogs database
Name: Arun Maurya
IRC: asymmentric
Github: asymmentric
Email: arunmaurya621@gmail.com
LinkedIn: Arun Maurya
Timezone: GMT +5:30
Project Objectives
The Bookogs database dumps were made publicly available for download in JSON format right after the closing of the project. In order to prevent the loss of Bookogs contributions and many such different contributions, create a backend system to parse and import data from large scale databases such as SQL/NQL OLAP etc.
- Parse large database dump files efficiently
- Create adapters to transform entities from one database schema to other
Parsing Database Dumps(Producer)
Problem encountered while importing databases is handling large files which can not be processed directly, keeping in mind the memory limitations. This can be achieved by streaming large files and processing the data into chunks. Data chunks incoming from the files can be converted into Objects to make it easy to manipulate and push it to the message queue.
Inducting entities into BookBrainz database(Consumer)
To effectively use data objects incoming to consumer from the message queue, it should be filtered for creation of SQL templates. Based on filtration of an Object, it should be mapped with the entities
- Scripts to filter and scrap objects for entity data.
- Insert templates should be created for different entities.
Objects would be inserted into separate entities and would be inducted into main database entities after manual confirmation from editors.
TIMELINE
Pre-GSoc(Before 19th May)
- Explore the BookBrainz project to have an even more clear understanding of the codebase.
- Set up the environment
- Create tickets and solve some.
- Discuss with mentor and community members about the project for suggestions.
My PRs: Check Out
Community Bonding Period
- Extend mentor and community interaction, and introduce my project to the community
- Analyse Bookbrainz schema and understand entity relationships.
Phase 1
Week 1-2
- Explore bookogs data dump to understand data organisation and categorisation.
- List down types and categories present in bookogs database which are not in BookBrainz and discuss the same with the mentor.
Week 3-4
- Setup Zookeeper and Kafka on docker
Week 5
- Create parser to extract data from various file formats and convert them into Objects.
Week 6
- Discuss with the mentor and community members regarding modification of Bookbrainz database to accommodate for the types,categories and extra data which are not yet possible to add in the existing BB database.
First Evaluation Deliverables
- Setup and deploy Kafka on docker
- System to efficiently parse large files and push it to message queue
Phase 2
Week 7
- Map imported data/entities with that of the main Bookbrainz database to transform imported entities into Bookbrainz entities.
Week 8
- Write scripts to filter object data for different entities
Week 9-10
- Create insert templates.
Week 11-12
- Buffer time for unexpected delay and issues
Final Evaluation Deliverables
- System to filter incoming objects and insert them into database accordingly
Post GSoC
- Be active in the community and contribute to Bookbrainz to add more features and enhance functionalities
- Work on improving Bookbrainz Schema to allow more features.
About me
I am a sophomore pursuing my bachelor’s degree in Computer Science from Vidyavardhaka College of Engineering, Mysore. I have a background in backend development and like to play around with APIs a lot. I have a tendency to learn about how things work. I love building stuff that channels innovation.
I am working consistently to contribute to the community and enjoying the process very much.
Other Information
Tell us about the computer(s) you have available for working on your SoC project!
- Ideapad Slim 5
- Processor: Ryzen 7 5000 series
- RAM: 16 GB
When did you first start programming ?
It has been 4 years since I started programming.
What type of music do you listen to ?
My music taste varies a lot from Nusrat Fateh Ali Khan, Manoj Tiwari to OneRepublic, The Chainsmokers and Juice Wrld, Drake and The Weeknd.
What type of books do you read ?
I enjoy reading Hindi Poetry and Non-Fiction novels
- Cb61de22-6eba-4cde-afdf-44ba252ef3ec
- 95aaad89-c372-4a87-9e2a-571208a1e85e
- 48a73c7f-5cfe-4cb8-92d1-8a623bbddce0
What aspects of the project you’re applying for interest you the most?
What I like most about BookBrainz and the community, is its will to preserve the contributions and let it be available to all.
Have you ever used MusicBrainz to tag your files?
As of now I have not but I’d be doing it in future for sure.
If you have not contributed to open source projects, do you have other code we can look at?
Yes, I’ve various projects like Unexplored Places API, Beer Search, Educational ERP and many other projects which are available on my GitHub.
What sorts of programming projects have you done on your own time?
I have done some IoT projects like automating RC Cars and UAVs for fun.
How much time do you have available, and how would you plan to use it?
I plan to dedicate 28+ hours per week.