[GSoC 2026 Proposal] Semantic Music Discovery via High-Dimensional Embeddings

Pakeeza · March 31, 2026, 2:29pm

Hi everyone! I’m Pakeeza.

I’m really excited about ListenBrainz. My background is in Python and AI research, and I want to help solve the “Cold-Start” problem for new artists. Right now, recommendations depend a lot on listening history, but I want to use MusicBrainz metadata (tags/genres) to create a Semantic Similarity Engine using Sentence-BERT and pgvector.

You can find my full proposal draft here: [GSoC 2026] Improving ListenBrainz Recommendations with Semantics - Google Docs

I’ve also submitted the final PDF to the GSoC portal. I’d love to hear any thoughts or feedback from the mentors if you have a moment!

Thanks for the help

gbw614 · April 12, 2026, 12:34am

Hi Pakeeza,

This is an interesting proposal but I suspect that MusicBrainz tag density follows a heavy power law i.e. the artists who most need content-based discovery are exactly the ones with the sparsest tags. If an artist’s metadata is just [“Artist Name”, “USA”, “Rock”], generating a 384-dimensional vector from that isn’t very meaningful. So I’m not convinced this would solve the cold start problem.

Then again, I haven’t analyzed the meta-data. If you haven’t done any sort of clustering analysis on the meta-data/tags it may be worth it before investing time in all the aspects of the project.

Regards,
Gerrit