HMN 2025: How Challenging negative data helps AI models better identify effective antibodies

How can we train AI to develop new vaccines and medicines?
Training data composition determines ML generalization and biological rule discovery. Credit: Nature Machine Intelligence (2025). DOI: 10.1038/s42256-025-01089-5

Imagine you are developing antibodies—drugs precisely aimed at a target, for example a viral protein or onco-marker. You test a series of antibodies and find that some work, while others do not.

You would like to continue modifying them and see if you can make them even better. However, you do not want to waste time testing those that certainly will not work. To only test that might work, you need to separate those antibodies that do not bind to your target before moving on to costly and time-consuming experiments.

One way to do this is to train a that can support you in the process. Today, models are already helping experimental scientists narrow down their search.

“Moreover, machine learning models, once shown the data, can learn what makes an antibody bind—what features set binders apart from those that do not. Without such models, this is not obvious at all, as it lies beyond and intuition,” says Aygul Minnegalieva, a Ph.D. candidate at the University of Oslo.

She investigates how to best train AI models at the Greiff Lab. Minnegalieva and colleagues have recently published a study on this in Nature Machine Intelligence.

“However, not all machine learning models will do that correctly. Only if models are trained with the right data, we can use them to gain an understanding of biological determinants. For example, what makes an antibody a binder,” she explains.

“One approach to accomplish this is to present the models with examples of both correct and incorrect responses regarding what we want them to recognize,” explains the Ph.D. candidate.

Such incorrect examples or errors are referred to as negative data, while the correct examples are classified as positive data.

The errors must pose a challenge for the models to recognize. In the latest study, Minnegalieva and her colleagues discovered that the negative data the models are exposed to must be sufficiently challenging.

“We need to show the models incorrect examples that closely resemble the correct ones. This way, the data models learn more effectively,” Minnegalieva points out.

Specifically, the researchers presented the models as negative data with antibodies that still bind to target proteins, for instance in a virus, but do so suboptimally.

“In this manner, the models improved their ability to accurately tell apart antibodies that would be effective in combating a pathogen from those that would not,” she explains.

Most importantly, this method enabled the models to capture the underlying sequence determinants in antibodies that help them bind to a protein in a pathogen.

“Those determinants made more biological sense,” Minnegalieva states. “Essentially, the models became better at reasoning.”

Accelerating the development of antibodies and medicines with AI

Machine learning is increasingly being employed in the development of new medicines, allowing researchers to reduce the number of experimental tests required.

“We can reduce the number of errors when developing new candidates of antibodies or medicines for targeting pathogens or cancer,” says Minnegalieva. “The models we use must be both accurate and reliable. They must truly understand what matters from a biological point of view. Only then can we make sound predictions and save time.”

The new study outlines how the models can be trained to better meet these requirements.

Though the study specifically focused on antibodies, the results can be broadly generalized across various fields where machine learning is applied.

“Fields such as language modeling, protein design, and the prediction of molecular properties also depend on the sampling of negative data. All these areas face the risk of models taking shortcuts if the negative examples are too simplistic,” concludes Minnegalieva.

Professor Victor Greiff, head of the Greiff Lab, also highlights the relevance and potential impact of the study. “Our work shows that data curation is not a preprocessing step, it’s a scientific choice that encodes assumptions and determines what machine learning can discover.

“For immunology, , and beyond, careful dataset design may be the key to building machine learning models that generalize and reveal true biological principles,” Greiff says.

More information:
Eugen Ursu et al, Training data composition determines machine learning generalization and biological rule discovery, Nature Machine Intelligence (2025). DOI: 10.1038/s42256-025-01089-5

Wesley Ta et al, The importance of negative training data for robust antibody binding prediction, Nature Machine Intelligence (2025). DOI: 10.1038/s42256-025-01080-0

Provided by
University of Oslo



The content is provided for information purposes only.