GSoC 2018: SpamBrainz

Tags: #<Tag:0x00007fcb50c4d358> #<Tag:0x00007fcb50c4cbd8> #<Tag:0x00007fcb50c4c610> #<Tag:0x00007fcb50c53ed8> #<Tag:0x00007fcb50c53848>


Hello people,


If the above heading interests you, then please keep reading!
Here I share with you my proposal for SpamBrainz. Please excuse me for not typing out the proposal here as I feel it’s easier to get feedback on google docs than here. I had shared this earlier with a potential mentor (@yvanzo) and have received valuable feedback from him. I intend to gather as much feedback as I can before submitting it. Please free to comment on any improvements you seem fit or in case you have any doubts.

Project description wiki link(last on the page) : link

Thank you!


Hello! Thank you for sharing your draft publicly, however it is still very incomplete.


Thank you @yvanzo for the reply(again!). I think you are the only one who hates spam at MetaBrainz :smiley:
About the proposal, it is still a work in progress and I am constantly trying to improve it! It would be really awesome if you could be direct and point out the mistakes/improvements/alternatives in the shared document. If it is convenient for you, let me know and I will try to post the proposal here on the forum.

Sorry to keep spamming you for feedback, but it won’t happen after GSoC is over. :wink:



The most obvious: no timeline, unanswered questions. Implementation part is actually very abstract, it doesn’t mention deployment and extensibility to other *Brainz. Additionally, it doesn’t really consider the specifics of MusicBrainz. For example, how does that apply to URL? Another example: Artist is connected to many other entities through Artist Credit and Relationship. It seems to be possibly useful. What are the boundaries of entities to be analyzed? Or how to determine that?


[Update] Hi @yvanzo!

As indicated, I have updated the timeline, answers and the missing implementation parts like deployment, testing/evaluation methods. Please have a look at it!

I designed this model keeping in mind that data will be text. It doesn’t matter either it is coming from MusicBrainz or BookBrainz. The idea is to let bayesian find the features that decide whether the given information is spam or genuine rather than us providing the guidelines. For example, suppose one feature is that data coming from URL field is generally spam when it has more than 5 ‘!’ in it. Bayes will learn that when it will observe that in 100 spam emails one common thing was 5 ‘!’. There could be 100 features like that which might or might not be the good indicator of spam. If we decide to add these relationships specifically, then it could lead to overfitting.
So the basic overview of the algorithm looks like:

  1. Extract as many features as we can from the input data.
  2. Let Bayes use these features to learn and predict labels for new data.
  3. Reduce this dimension. For this, we use chi-square to score the features based on our training data. It is done specifically to avoid overfitting of our model.
    To be honest with you, it is very difficult to say if this model will surely work or give 99% accuracy. It can be 99.5% or even 50% and the majority of that depends on the underlying data used for training a classifier. If, for example, training data has 1000 rows of entirely 1000 different spam messages then it is as good as tossing a coin and labeling the message spam on a head and genuine on a tail. Thus, if you look at my proposed timeline, it has only first milestone devoted to developing a model but two for improving it.

Thanks for your constant feedback and please keep it coming!



Ok, thank you. A few additional questions in passing:

  • May one of the many existing Bayesian filters in Python be used/extended for this project?
  • Will the Bayesian filter be trained only once or will it keep learning from further input?
  • May URL filtering based on third-party blacklists be used as a feature?


As we are using bag-of-words to store our feature frequency count, calculating the probability using Bayes theorem for each feature isn’t a big task and it gives us additional flexibility to replicate that for each entity source without much effort.

To make the Bayesian filter learn from new inputs, we need to manually correct their predictions and I have included two weeks(week12-13) in my timeline for that. I shall continue to do so, whenever I get time after that.

This can be done as a first step lookup before feeding the text into the model as I am assuming that if a blacklisted URL is present in the text, it is 100% spam.
If we add this as a feature, then based on our threshold to classify a text as spam, our model might classify it as non-spam as these URLs won’t be present in every training spam text and then it will depend on other features of the text.


Hello @yvanzo,

My quarter end exams will get over by the end of this week and I will be completely jobless for a week after that. So, I was wondering if it is possible to look at the spam data(already collected) or Is it only available if the proposal is accepted?



My quarter end exams will get over by the end of this week and I will be completely jobless for a week after that. So, I was wondering if it is possible to look at the spam data(already collected) or Is it only available if the proposal is accepted?

Hey @yvanzo @Freso, sorry to keep bugging you but is there any update for above? :slight_smile:



No, spam data won’t be available by announcement on April 23, 2018.


Hello @yvanzo,

Hope you are doing well and enjoying the summer!

I am writing you again to show a new project I was working on from past month. It is a tiny bit similar to the SpamBrainz and you can find a demo here(demo) and gitlink for the same here(git).

This as an NLP project which retrieves the name of a movie from a tweet! It is quite a challenging task as there can be multiple movie names in a tweet and we need to extract the one used in the context of the movie. For example, one tweet like ‘I hate frozen food’ and consider another tweet ‘I am in love with songs from Frozen #newFoundLove’. So our model should ideally return Frozen for the second tweet but none for the first one.

We developed a pipeline that starts with tokenizing tweets, normalizing the identified tokens and then identifying the candidates by matching them with our movie gazetteer. Next step is to classify the candidate as a movie based on features like NGRAM, orthographic projections etc.We used SVM model(because we were reproducing this paper for our coursework) with specified parameters.

How to use demo? Add a tweet like

RT @LeighMcManus1 : 85 minutes into Dallas Buyers Club and I 'm only realising now that Jarred Leto is the tranny
in the text input and click Retrieve. It will show all the intermediate steps of the pipeline.

Let me know if you have any doubts/feedback etc etc!



Hello @anandkanav,

Sorry but your application has not been retained, mostly because we cannot handle two incompatible applications for the same topic. I can provide a more detailed feedback either here or by email if you want to. If you are interested in applying for GSoC 2019, there will be a lot of other possibilities with machine learning in MetaBrainz projects. If so, please get in touch sooner.

Thank you!


Hi @yvanzo!

No problem :slight_smile:
Though I would love to have the detailed feedback and anywhere, mail or here, is fine by me.



Your proposal was technically sound, but the decision has been mainly made on the ground that you had less insight about MusicBrainz than the other applicant. There are some indications of that in your proposal compared to the other: no differentiation between fields, no connection with editors, edits, and entities, manually tweaked offline learning, quite abstract deployment, and so on. Other hints are that you discovered the community quite lately and that you did not even publicly get in touch before making your proposal. I have no doubt you have the technical skills to participate into such program but this time was a bit unprepared. Still wishing to see you around again!