Data importing app
Discussion about the app which can automatically collects data from various sources, matches it, and allows the user to send it to the BB import queue.
- The European Library
Sample data(from National Library of Portugal): https://gist.github.com/anonymous/cb373492831a63dcf826
- Their data is CC0 licensed (only database dumps though - API fetched data is not licensed for commercial purposes)
- A lot of entries (a little more than 95 millions)
- Data in various languages (because it is from European libraries)
- Duplication is handled by checking if identifier, like the one used by European Library is already in the database. To do so the app is going to download a list of previously rejected identifiers. It is going to contain most of entries from BB, so there is no problem with requesting each entry that isn't present in that list.
- It is going to use pybb for searching, downloading, sending data from and to BB
- Authors are going to be matched by searching for name, identifiers. (for example if I want to search for Orwell with European Library id='abcd123' It would firstly search for all Orwells in BB and then check identifiers of each one of them).
- Python is going to be used as a language, PyGtk+ as UI toolkit (not pyQt, because it lacks good docs)
Matters under consideration, ideas
- The app shouldn't only be able to add new entities but it should also be able to fill the old ones with more data. So simply checking for entry in a list is not sufficient in this case anymore.
- The European Library doesn't seem to distinguish Editions and Publications, so the solution should be to add them at once when importing (firstly searching for publications with given alias, asking user which publication is it, then asking user if currently added edition doesn't match any of those created already)
Feel free to post your ideas here