Unified Genomic and Chemical Representations Enable Bidirectional Bio-synthetic Gene Cluster and Natural Product Retrieval
Unified Genomic and Chemical Representations Enable Bidirectional Bio-synthetic Gene Cluster and Natural Product Retrieval
Liu, G.; Li, Y.; Ong, G.; Wong, F. T.; Tay, D. W. P.; Lim, Y. H.; Foo, C. S.; Koh, L. C. W.
AbstractNatural product discovery is increasingly driven by the ability to analyze microbial genomes for biosynthetic gene clusters (BGCs) that encode secondary metabolites. While existing approaches have successfully linked BGCs to broad classes of chemical products, they typically operate in a single modality (genomic or chemical) limiting the scope of bidirectional prediction. In this work, we propose a multimodal framework that integrates genomic and chemical information by projecting embeddings derived from pretrained language models into a common representation space. We embed genomic sequences using a BGC foundation model and represent molecules through a chemical language model, then use a metric learning model to co-embed BGCs and their associated chemical structures. This co-embedding space allows us to quantify the similarity between BGCs and compounds using similarity measures, enabling both forward and inverse retrieval tasks. Beyond retrieval, we show that the shared space can guide strain selection for targeted compounds. By identifying BGCs closest to a query compound in the embedding space, we prioritize microbial strains that encode similar clusters, thereby streamlining genome mining and retrobiosynthetic design efforts. This approach represents a generalizable, scalable strategy to bridge biological and chemical modalities in natural product discovery.