HMN 2026: How Benchmarking framework reveals major safety risks of using AI in lab experiments

Benchmarking framework reveals major safety risks of using AI in science experiments
Overview of LabSafety Bench. Credit: Nature Machine Intelligence (2026). DOI: 10.1038/s42256-025-01152-1

While artificial intelligence (AI) models have proved useful in some areas of science, like predicting 3D protein structures, a new study shows that it should not yet be trusted in many lab experiments. The study, published in Nature Machine Intelligence, revealed that all of the large-language models (LLMs) and vision-language models (VLMs) tested fell short on lab safety knowledge. Overtrusting these AI models for help in lab experiments can put researchers at risk.

LabSafety Bench for AI use in labs

The research team involved in the new study initially sought to answer whether LLMs can effectively identify potential hazards, accurately assess risks and make reliable decisions to mitigate laboratory safety threats. To help answer these questions, the team developed a benchmarking framework, called “LabSafety Bench.”

The framework included 765 multiple-choice questions, 404 realistic lab scenarios and 3,128 open-ended tasks on topics regarding hazard identification, risk assessment, and consequence prediction in biology, chemistry, physics and general labs.

Altogether, the team evaluated 19 AI models, including eight proprietary models, seven open-weight LLMs and four open-weight VLMs with LabSafety Bench. For VLMs, 133 text-with-image multiple-choice questions were used. The open-ended tasks included HIT, which measured risk perception and a consequence identification test, CIT, which measured outcome prediction.

Critical gaps in AI’s experimental science knowledge

While some proprietary models performed well on structured tasks, like GPT-4o (86.55% accuracy) and DeepSeek-R (84.49% accuracy), they still struggled with open-ended, scenario-based reasoning. On multiple choice questions, the otherwise top-performing models still performed poorly on radiation hazards, physical hazards, equipment usage and electricity safety.

Most concerningly, none of the models evaluated surpassed 70% accuracy in hazard identification tasks. In the HIT and CIT tests, models generally performed better in biology and physics scenarios, but struggled with chemistry, cryogenic liquids and general laboratory safety.

“Notably, several models scored below 50% on ‘improper operation issues,’ while for ‘most common hazards,’ even the worst-performing model scored 66.55%,” the study authors write.

Vicuna models stuck out as particularly poor performers across multiple tasks. In text-only multiple choice, Vicuna performance was almost as bad as random guesswork. InstructBlip-7B, which is based on Vicuna-7B, also had the weakest performance in text-with-image multiple choice questions.

The team attempted fine-tuning to explore methods for improving safety awareness. This improved smaller models, but advanced strategies like retrieval-augmented generation (RAG) did not consistently help. The researchers say that training on individual subsets yielded a performance improvement of around 5–10%.

Can AI be used safely in science experiments?

Surely, most AI models will continue to improve as time goes on. However, current models’ tendency to hallucinate and provide incorrect information makes them risky when working with dangerous materials, some of which can result in explosions, injuries, and loss of life. The findings from this study highlight the need for human oversight and improved AI safety training in research environments.

“Our analysis also identifies key failure modes—including poor risk prioritization, hallucination and overfitting—to guide future research. This work provides a foundation for safer AI integration in laboratories by underscoring the urgent need for safety-aware model development,” the study authors write.

The team says their results indicate that even top-performing models do not guarantee safe and reliable answers in laboratory settings, and that larger, newer or more advanced models do not guarantee better safety performance. Instead, they encourage other researchers to employ benchmarking tools, like LabSafety Bench. Additionally, they say that AI use in labs should always include rigorous human oversight, at least, until AI exhibits extensive improvement in lab safety knowledge.

Written for you by our author Krystal Kasal, edited by Gaby Clark, —this article is the result of careful human work. We rely on readers like you to keep independent science journalism alive.
If this reporting matters to you,
please consider a donation (especially monthly).
You’ll get an ad-free account as a thank-you.

More information:
Yujun Zhou et al, Benchmarking large language models on safety risks in scientific laboratories, Nature Machine Intelligence (2026). DOI: 10.1038/s42256-025-01152-1