ADD-ON: Adenylation Domain Database and Online Benchmarking Platform

Visual for ADD-ON; ADD-ON addresses the lack of reliable data for predicting how microbial enzymes assemble peptide-based natural products. By enabling accurate AI-driven structure prediction, it accelerates the discovery of new bioactive compounds and ultimately supports efforts to combat antimicrobial resistance.

Image: ADD-ON | info

What is the project about?

ADD-ON creates an open, reliable dataset of bacterial enzymes and the amino-acid building blocks they recruit to assemble nonribosomal peptides, an important natural product class that includes drugs like penicillin. It combines new standardized experimental data with curated public data and AI to accelerate the discovery of new bioactive molecules.

What gap in the scientific community led to the creation or expansion of this benchmarking dataset?

Predicting chemical structures directly from genome sequences of their microbial producers remains a major bottleneck in natural product discovery. For nonribosomal peptides — an important natural product class for the discovery of urgently needed novel chemical entities — this challenge centers on adenylation domains, key enzyme parts that determine which amino acids are incorporated into the final compound. Existing datasets describing substrate specificity of adenylation domains are small, biased, and inconsistent, hindering chemical novelty prediction. ADD-ON will fill this gap by expanding and standardizing experimental data to create a reliable benchmark for AI-driven structure prediction.

What is the project’s structure — from data curation to expected outputs such as publications or competitions?

ADD-ON follows a clear five-step structure. We first curate and standardize existing public data on adenylation domains and select diverse candidates for new experimental measurements. Next, we measure substrate specificities using our in-house high-throughput platform and integrate these results into a unified, well-annotated dataset. This dataset will then support the development of an open benchmarking platform, including defined prediction tasks and evaluation metrics. The final phase focuses on community engagement through a publication and an open competition inviting AI and bioinformatics groups to test and improve their models on the dataset.

What impact does the project aim to achieve — within Helmholtz and across the broader research and industry community?

ADD-ON will provide the first open, standardized reference dataset for predicting enzyme–substrate relationships in nonribosomal peptide biosynthesis. Within Helmholtz, it will strengthen collaboration between AI, bioinformatics, and experimental biology groups, creating a model for cross-domain data-driven research. In the wider community, it will establish a reproducible benchmark that encourages fair comparison of machine learning methods and supports the development of more accurate genome-to-structure prediction tools. In the long term, this will accelerate the discovery of new natural compounds and contribute to global efforts against antimicrobial resistance and other pressing health challenges.

Project partners

Helmholtz Centre for Infection Research (HZI), Helmholtz Institute for Pharmaceutical Research Saarland (HIPS)

Saarland University

Myria Biosciences AG

Primary Contact

Jun.-Prof. Dr. Alexey Gurevich, Helmholtz Centre for Infection Research (HZI), Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Human-Microbe Systems Bioinformatics research group

Other projects

Visual for GRIDMARK; Transforming energy systems toward climate neutrality: Distribution grids have the potential to be catalysts for the energy transition. Unfortunately, most Distribution System Operators lack the resources to fully monitor their systems. Therefore, there is an urgent need for more high-quality data, particularly to develop and test machine learning models.

info

GRIDMARK – Generating Reproducible Insights through Data Benchmarking for AI in Energy Systems

Transforming energy systems toward climate neutrality: Distribution grids have the potential to be catalysts for the energy transition. Unfortunately, most Distribution System Operators lack the resources to fully monitor their systems. Therefore, there is an urgent need for more high-quality data, particularly to develop and test machine learning models.

Visual for ForestUNLOCK; Building the first consistent multi-modal single tree benchmark for forest structure and carbon stock assessments of the northern boreal forest

Image: Open white spruce forest with glacier in background in the Chugach Mountains, Alaska, US ©Stefan Kruse, AWI | info

ForestUNLOCK: A multi-modal Multiscale Benchmark Dataset for AI-Driven Boreal Forest Monitoring and Carbon Accounting

Building the first consistent multi-modal single tree benchmark for forest structure and carbon stock assessments of the northern boreal forest

Visusal for SCHEMA;Metastases represent a significant exacerbation of tumor severity. If one could predict the likelihood of tumors metastasizing, this could inform treatment decisions to avoid or delay this outcome. SCHEMA develops a benchmark dataset of primary tumor samples and metadata on whether the tumor has metastasized at different time points after sampling. With this dataset, a challenge for machine learning scientists will be defined to build prognostic models for likelihood of tumors metastasizing, promoting innovation in prognostic modeling for a clinically relevant task.

Image: Hellmut Augustin, DKFZ (BSIC 2021 contribution) | info

SCHEMA – profiling Spatial Cancer HEterogeneity across modalities to benchmark Metastasis risk prediction

SCHEMA creates a benchmark dataset linking tumor samples with metastasis outcomes to enable machine-learning models that predict metastasis risk and support clinical decision-making.