The “Thought Experiments” proposed by Google AI to improve language models’ moral reasoning


Language models have made significant strides in natural language processing tasks. However, deploying large language models (LLMs) in real-world applications requires addressing their deficit in moral reasoning capabilities. To tackle this challenge, a Google research team introduces a groundbreaking framework called “Thought Experiments,” which utilizes counterfactuals to improve a language model’s moral reasoning. This innovative approach has demonstrated an impressive 9-16% increase in accuracy in the Moral Scenarios task.

The Thought Experiments Framework

The Thought Experiments framework is a multi-step prompting approach that iteratively refines the model’s responses. The researchers summarize the framework’s steps as follows:

1. Pose counterfactual questions: The model is presented with Moral Scenarios questions without answer options.

2. Answer counterfactual questions: Questions generated in the previous step are presented to the model, which is prompted to answer them.

3. Summarize: The model is asked to summarize its thoughts using the counterfactual questions and answers.

4. Choose: Multiple decodes from the previous step are provided, and the model selects the best one. This step is necessary due to the multiple ways of considering a situation morally.

5. Answer: The chosen summary and original answer choices are presented to the model, allowing it to provide a final zero-shot answer.

To evaluate the effectiveness of the Thought Experiments framework, the research team conducted experiments on the Moral Scenarios subtask within the MMLU benchmark. They compared their framework to four baselines for the zero-shot prompting approach: direct zero-shot, zero-shot Chain-of-Thought (CoT) with and without self-consistency.

The results were promising. The zero-shot Thought Experiments framework achieved an accuracy of 66.15% and 66.26% without and with self-consistency, respectively. This marks a significant improvement of 9.06% and 12.29% over the direct zero-shot baseline, as well as 12.97% and 16.26% over the CoT baseline.

The research showcases the effectiveness of the Thought Experiments prompting framework in enhancing moral reasoning within the Moral Scenarios task. It emphasizes the potential for future work to explore open-ended generations for addressing more ambiguous cases, such as moral dilemmas.

In summary, the Google research team’s innovative Thought Experiments framework presents a promising solution to augment the moral reasoning capabilities of language models. By incorporating counterfactuals and a multi-step prompting approach, this framework demonstrates significant improvements in accuracy. As the development of language models continues, it is crucial to prioritize responsible and ethical AI implementations, ensuring their alignment with human moral values.