Are New Scientific Ideas Produced by Language Models? Let’s introduce Contextualized Literature-Based Discovery (C-LBD).

The main principle of literature-based discovery (LBD) is the creation of hypotheses based on the literature. Link-based hypothesis testing (LBD) focuses on speculating connections between concepts that have not previously been studied together (such as novel drug-disease relationships), with drug development as its primary application sector.

These systems have developed into machine-learning techniques, however there are significant problems with this arrangement. If the “language of scientific ideas” is reduced to its simplest form, the hypotheses can’t be expected to be as expressive. In addition, LBD does not replicate the elements that human scientists take into account during the ideation process, such as the environment of the intended application, prerequisites and constraints, incentives, and issues. Last but not least, the creative and inductive nature of science, where new ideas and their recombinations constantly evolve is not considered in the transductive LBD context, where all concepts are known as apriori and need to be connected.

Researchers at the University of Illinois at Urbana-Champaign, the Hebrew University of Jerusalem, and the Allen Institute for Artificial Intelligence (AI2) try to address these complexities with Contextual Literature-Based Discovery (C-LBD), a unique setting and modeling paradigm. They are the first to use a natural language setting to constrain the generation space for LBD and also break away from classic LBD in the output by having it generate sentences.

Inspiration for C-LBD comes from the idea of an AI-powered assistant that can provide suggestions in plain English, including unique thoughts and connections. The assistant accepts as input (1) relevant information, such as present challenges, motives, and constraints, and (2) a seed phrase that should be the primary focus of the developed scientific concept. Given this information, the team investigates two forms of C-LBD: one that generates a full phrase explaining an idea and another that generates only a salient component of the idea.

To this end, they introduce a novel modeling framework for CLBD that may gather inspiration from disparate sources (such as a scientific knowledge graph) and use them to form novel hypotheses. They also introduce an in-context contrastive model that uses the background sentences as negatives to prevent unwarranted input emulation and promote creative thinking. Unlike most LBD research, which is directed toward biomedical applications, these experiments apply to articles in the field of computer science. From the 67,408 papers in the ACL anthology, the team autonomously curated a new dataset using IE systems, complete with task, method, and background sentence annotations.

By focusing on the NLP field specifically, researchers in that area will have an easier time analyzing the results. Experimental results from automated and human evaluations reveal that the retrieval-augmented hypothesis generation significantly outperforms previous methods but that current state-of-the-art generative models are still inadequate for this work.

The team believes that expanding C-LBD to include a multimodal analysis of formulas, tables, and figures to provide a more comprehensive and enriched background context is an intriguing direction to investigate in the future. The use of advanced LLMs like GPT-4, which is currently in development, is another avenue to investigate.