A roadmap for large language models in chemical research
In a Q&A, Gabe Gomes discusses the potential to combine human creativity with machine capability, transforming chemical research.
“There is a common misconception that using large language models in research is like asking an oracle for an answer. The reality is that nothing works like that,” says Gabe Gomes.
Gomes, assistant professor of chemical engineering and chemistry, does believe that large language models (LLMs) can transform chemical research, if they are adopted thoughtfully. In Nature Computational Science, Gomes and his coauthors offer a roadmap toward more strategic implementations of LLMs.
The current state of chemical research is generally separated into computer modeling and laboratory experiments. Scientists might spend months using computers to predict how a molecule can be made and will behave. Other scientists might spend months in the lab actually making and testing that molecule. The two approaches are not well-integrated.
“This is where LLMs become exciting,” says Robert MacKnight, a Ph.D. student in chemical engineering. LLMs have the potential to remove the silos between computer predictions and real-world testing, ultimately accelerating discovery.
In 2023, Gomes and his research group published Coscientist, an LLM-based system that can autonomously plan, design, and execute complex scientific experiments. As LLMs are increasingly implemented in scientific research, Gomes sees the role of the researcher shifting toward higher-level thinking: defining research questions, interpreting results in broader scientific contexts, and making creative leaps that artificial intelligence (AI) can’t make. Rather than replace human creativity and intuition, AI systems can amplify our ability to explore chemical space systematically.
Here, Gomes and MacKnight answer several questions about where LLMs can make an impact and where they might fall short.
How has your experience developing Coscientist influenced your views of the future of chemical research?
Developing Coscientist revealed to us that LLMs have tremendous potential to accelerate the pace of chemical research, particularly in data collection. It also showed us that LLMs alone aren't enough. The real breakthrough comes when you combine them with external tools, like databases, laboratory instruments, or computational software. Without tools, you're limited by what the model learned during training, and you risk hallucination. Tools help ground the LLM's responses in reality. One of the things we are most excited about is the move toward what we call "active" environments, where LLMs interact with tools and data rather than merely responding to prompts.
What is the difference between deploying LLMs in an “active” or a “passive” environment?
In a “passive” environment, LLMs answer questions or generate text based on what they learned during training. In an “active” environment, LLMs can interact with databases and instruments to gather real-time information and take concrete actions. This distinction is crucial in chemistry. A “passive” LLM might hallucinate a synthesis procedure or give you outdated information. An “active” LLM can search current literature, check chemical databases, calculate properties using specialized software, or even control laboratory equipment to run actual experiments. Instead of being limited to its training data, the LLM can coordinate different tools and data sources to solve real research problems. This transforms how we think about the role of the researcher. Instead of someone who executes experiments, the researcher becomes more like a director of AI-driven discovery.
What unique considerations are there for applying LLMs in chemistry, compared to other domains?
First, there are safety considerations. Hallucinations in chemistry aren’t just an annoyance. They can be dangerous. If an LLM suggests mixing incompatible chemicals or provides wrong synthesis procedures, you could have serious safety hazards or environmental risks. Second, chemistry has very specific technical languages that general LLMs struggle with. Third is the precision problem. Chemistry requires exact numerical reasoning, and LLMs aren't naturally good at that. A small error in molecular representation or spectral interpretation can completely change a result. Finally, chemical research is inherently multimodal. We work with text procedures, molecular structures, spectral images, and experimental data all at once. Because most LLMs are primarily text-based, incorporating all these types of chemical information is a particular challenge.
All of these constraints mean that the field of chemistry really benefits from the "active" LLM approach we advocate, where the model works with specialized tools and databases rather than trying to do everything from its training alone.
What are the biggest challenges you see for the adoption of LLMs in chemical research?
The biggest challenge is perceived trustworthiness. Researchers are rightfully cautious about adopting AI tools when safety and accuracy are paramount, and current methods for evaluating LLMs are insufficient.
Beyond trust, there are several technical hurdles. Hallucination is a major concern, as noted above. There is also the challenge of integrating LLMs with existing laboratory infrastructure and specialized chemical software, which often requires significant technical expertise. On the practical side, there is a learning curve. Many researchers lack experience with AI tools and may not know how to implement them effectively. Finally, there are ethical and resource considerations, such as the environmental cost of training and running these models, potential biases in chemical knowledge, and questions about how these tools might change the nature of scientific work itself.
If we can first improve evaluation methods to demonstrate that these systems are trustworthy and reliable, we will likely unlock progress on many of these other challenges.
How do you propose to better evaluate LLM capabilities in chemical research?
Current evaluations often test only knowledge retrieval. We see a need to evaluate the reasoning capabilities that real research requires, and we co-founded a consultancy firm for scientific evaluations of AI models.
To ensure we’re testing actual reasoning rather than memorization, we need to design evaluation tasks using information that became available after the model's training. For LLMs that use tools, we should test whether they choose the right tools in logical sequences and adapt when tools fail. Finally, we should incorporate human expert judgment alongside automated benchmarks. Chemical reasoning has subtle nuances that fixed tests miss. The goal is to have frameworks that predict how useful an LLM will be in real chemical research, not just how well it performs on standardized tests.
Where do you see the most promising applications for LLMs in chemical research?
LLMs can help researchers navigate vast literature, extract relevant information, and identify research gaps or contradictions across papers. They also show great potential for planning tasks. These include designing experiments and generating testable hypotheses. Automation is another key area. LLMs can translate between natural language and programming languages. In other words, they can take an English description of an experiment and convert it into executable code, making it easier to control laboratory equipment and cloud labs.
The common thread is that LLMs excel when they are orchestrating existing tools and data sources. The most powerful implementations leverage their natural language capabilities to make complex research workflows more accessible and integrated.