Recently several versions of foundation models of MS2 spectra have been developed [1]. These models can be used for compound identification or for prediction of chemical properties of unidentified compounds. The differences between these models can be assessed using labelled or using unlabelled spectra. Using labelled spectra this is easier as the structure of the compound is known. However, the number of labelled spectra is much lower than the number of unlabelled spectra.
In this application we have 25 million unlabelled spectra, which all have a position in the embedding space. We want to look at the difference in position between each pair of spectra to explore the differences between different embedding spaces. Besides UMAP, t-SNE, PCA etc also other approaches [2] have been suggested. In this project we want to explore these approaches to understand how the can be helpful to explore the differences between embedding spaces.
[1] https://dreams-docs.readthedocs.io/en/latest/index.html
[2] https://apple.github.io/embedding-atlas/
Study program(s)
MSc Bioinformatics and Systems Biology
MSc Computational Science
