Relating content-based music audio recommendation thesis to Annoy proof of concept

Hi, in working through my gsoc proposal I have spent a lot of time familiarizing myself with and understanding the thesis paper and code on content-based music audio recommendation ( and respectively). I’ve also dedicated time to understanding the Annoy algorithm and the Annoy python api

The code from the previous implementation of content-based music audio recommendation uses 12 different metrics to quantify the similarity to (or distance from, in the case of nearest neighbours) other recordings. I believe that I have a good understanding of how I can make use of these metrics in Annoy to find nearest neighbours. However, I did notice that in the previous implementation similarity only ever seems to be determined or tested based on one metric at a time:
e.g. when timing the query in probe_endpoints in similarity/

for metric, _ in metric_list:
    time = timeit.timeit("get_similar_recordings('{}', '{}')".format(mbid, metric),
                                     setup='from similarity.api import get_similar_recordings',
    print(mbid, metric, time) 

I was wondering whether any previous work considered methods of translating the 12 metrics into one metric that ultimately accounts for all high and low level content data in recording similarity? If this has been considered and isn’t possible or doesn’t make sense to do, could anyone provide me with some explanation?

If it does happen to be possible to translate the 12 metrics into one compact metric, this could have many benefits for the further tasks related to recording similarity in For example scaling down to one metric could make it much simpler to create and constantly update a single index within Annoy. It could also result in less data to store in an offline matrix, etc.

Am I on the right track here, or missing an important detail that supports using the metrics separately?

Thank you in advance for your time - sorry for the long message!!