Allowing users to run jobs on their local machines and then uploading the results on the server

kartikgupta0909 · March 2, 2016, 11:35am

The idea of this project will be to reduce the load on acousticbrainz servers by allowing users to run machine learning jobs on the local machines rather than acousticbrainz server and then to upload the results to the server. We would want to build a simple client which would do that. For the scope of Gsoc 2016, we aim to build a client for Ubuntu 14.04 (The latest ubuntu long term support release).
This client will download the data from the server, install the necessary libraries (such as Gaia2) and then run the machine learning job on that, and then submit the results on the server which will be saved for use in future.

alastairp · March 3, 2016, 10:23am

Hey Kartik,
We’ve already talked about this idea, so you know that I’m interested in it too. If you want to work on it, however, you will also need to show that you’ve thought about the problem yourself a little bit and how we could approach the problem. We like our projects to be a collaboration between the student and mentors, rather than the mentors telling the students exactly what to do, even at the project discussion time.

This client will download the data from the server, install the necessary libraries (such as Gaia2) and then run the machine learning job on that, and then submit the results on the server which will be saved for use in future.

I’d like to see if you can suggest some specific steps that we’re going to have to do here. Here are some questions which are unanswered by your suggestion

How should people install the client? Will we install Gaia from source, or provide packages for it. If we provide packages, who will build them?
What changes will be needed to get data from the main AcousticBrainz server to this remote server?
Should the remote server send periodic status updates to the main AcousticBrainz server during the training process?
Do we need a method of preventing malicious submissions from users? Since this will probably be some sort of HTTP POST we want to make sure that we get the type of file that we are expecting
How will the user run the client? Will there be a command-line tool, or a web interface that we automatically set up? Will the user be able to log off the machine and keep the process running?

kartikgupta0909 · March 3, 2016, 10:55am

Thank you Alastair for having a look.
What I have in mind is something similar to the acoustic brainz client. A command line tool, simple to install and simple to run. The installer script for the client will itself install gaia (https://github.com/MTG/gaia), using the instructions given on the github page. The user can be a non developer one so we should keep it as easy as possible to install and use.
To get the data from the server to the client, initially I though of serializing the data into some other form other than JSON, but then since the data is already stored as JSON in the lowlevel_json table on the server, if we directly send it as JSON, we can save on the serialization/deserialization time. So we need to make a simple API on the server which serves the details of the dataset for a given Id, in a similar way as the db/dataset.py but for an HTTP get request.
Once we have downloaded the data, there will be a module similar to the the dataset/evaluate.py, which will actually call the functions in gaia and pass the dataset details to it and also we will have to put up a file similar to dataset/gaia_wrapper.py with the client.
Once we have results from gaia, we can send them to our servers using a Post request, and we will have to validate the format of the results, which will then published on the dataset evaluation page.
I want to make it as a command line tool, running in a similar fashion as acoustic brainz client. Lets say the name of the client is “run_job”. Then it should be as simple as running “run_job <dataset_id>”, from the command line and then the script should do all the job. And yes this should also run even when the user logs off, since the jobs might take much long to run. And since it does not require any kind of authentication, anyone can run this from any machine(their own machine or cloud services).

Deleted_Editor_1240840 · March 3, 2016, 11:31am

You say that it doesn’t require any kind of authentication. Does this mean that any user can submit evaluation results for any dataset?

kartikgupta0909 · March 4, 2016, 8:53am

Here is a detailed description for the implementation of this client

Cient Side Files
Installation Script:
1.Install Yaml and dependencies on its own in the way done in admin/install_hl_extractor.sh
2.Install Gaia2 and dependencies
3.Install the client system wide

Get_Dataset Script:
1.Takes Dataset Id as an argument
2.Sends the dataset author token
3.Sends a request to the Acousticbrainz server consisting of dataset_id, dataset author token.
4.Recieves the Dataset
5.Returns it to the caller function
6.Also the server returns a job id along with it which will be used as the id of the new job.

Evaluation Script:
This will be the one to be called by the user, with the dataset id and dataset author token ad command line arguments.
1.Calls the Get_Dataset script to get the data.
2.Coverts it into Yaml
3.Passes it to the training model functions in the gaia_wrapper script
4.Gets the best result
5.Calls the Send_result script to send the results back to the server

Gaia_wrapper:
Same as the original one in the server code.

Send_Result Script:
1.Checks the form of result (checking for something malicious).
2.Sends the dataset author token
3.Sends a request to the server with dataset id, job id and job results.
4. Job is is the same as the one recieved from the server by Get_Dataset script.
5.Recieves back acknowledgement for submission

Server Side Files
Serve_Data Script
1.Checks for the dataset author token.
2.Checks if the user is the author of the dataset.
3.Retrieves the Dataset details for the given Dataset Id
4.Sends it to the client

Accept_Result Script
1.Checks for the Dataset author token
2.Checks if the user is the author of the dataset.
3.Checks the form of the result.
4.Accepts the result and stores it into the database.

Authentication:
Whenever a user generates a dataset, we will generate a token for that dataset which will only visible to that user and can see that on the dataset page, and he will need to pass that token as an argument to the client and this token will be used to see if the person requesting for the dataset is the user of that dataset or not, as only the author will be able to see the author token for that paticular dataset. We cant use the metabrainz authentication since it requires a URL to redirect to.

alastairp · March 11, 2016, 10:17am

This is a pretty good detailed list of what the project could involve, however I would like to see a list that is a little bit higher level.
For example, I think in this project there are 5 main processes:

System setup
Preparing the job on the main AcousticBrainz server to be “picked up” by the remote server
Retrieve the job
Process the job
Submit the job back to the main server

Notice how I’m not talking about them in terms of script name, or details, but rather the general action that they need to perform. It might be that each step needs more than 1 script, or 1 script could perform 2 steps, but at this stage I don’t think we need to think in as much detail.

Each step has a set of requirements. You’ve made a good start at listing a number of steps, but again, I’d like to see some more detail about general requirements instead of specific steps. You should also think about possible failure cases, and what we might want to do. These requirements can change during the application period, and even during the project, so don’t worry about getting it right immediately.
Here are some questions that I’ve thought about for things that might happen during the process:

How does a user say they want to process the job themselves instead of using the training system on AB.org
If a user is using their server for other things, how can we isolate our runner system so that it is not affected by existing software?
If the dataset is large, copying json files one at a time will take a long time. How can we speed up this process, perhaps by compressing the data somehow before sending it
The training process can take a long time, how can a user log out of their server and leave the process running
Should the process be completely automated, or should the user perform each action themselves?

Try and update this proposal with some of these suggestions, trying especially to describe each step in a series of general requirements and also trying to think not just about the main steps that you want to take, but also possible error conditions or other things that could happen.

kartikgupta0909 · March 12, 2016, 10:24am

Thanks Alastair for your feedback.

We will give an option to the user (Something like putting the artist filter or not) on the dataset evaluate page to either run it on the AB server or on their own server.If they choose the AB server then the workflow will be the current one and if they choose to run it on their own machine then an authentication token will be generated and given to them.
2.I will use Screen (https://help.ubuntu.com/community/Screen) for isolating the training process. It also allow the process to run in the background when the user logs out.Once the user runs the command we will do everything through screen and once the evaluation is complete it will automatically submit the results to the server (using the authentication token). We will also have to install screen through the installation script.
3.We can instead use Google protobuf for sending the data. We can send a single protobuf file consisting data of all the songs. But this will add the cost of serializing the data from the presently Json(in the table lowlevel_json) stored files to protobuf. Deserializing might not be an issue, because instead of deserializing the Json we will now deserialize protobuf. I also plan on sending a file hash just for verification that the file did not get corrupted.
4.Screen will handle this thing. If the user logs out but the machine is still on then screen will be able to run the job for us.
5.Once the user has run the command “executable_name dataset_id authentication_token”, then it should be our job to do all the things. I want to make it friendly for non developers too and hence everything should be automated.

kartikgupta0909 · March 17, 2016, 10:52am

I have uploaded a draft on the Google Summer of code website and would love to have the feedback of the community.