Stanford AI Releases Stanford Human Preferences (SHP) Dataset: A Collection Of 385K Naturally Occurring Collective Human Preferences Over Text

Machine learning and deep learning models are pervasive in almost every sector today. Model improvement is one of the main obstacles in these ML and DL projects across various industries. Reinforcement Learning from Human Feedback (RLHF) is a technique that uses human feedback to improve a language model using techniques from reinforcement learning directly. Language models can now start to match complicated human values to a model trained on a large corpus of text data thanks to RLHF. Human feedback is used to train models like ChatGPT. However, acquiring this data is quite expensive. 

A new Stanford research released Stanford Human Preferences (SHP), a dataset containing the aggregate preferences of 385,000 individuals for answers to queries and instructions over 18 distinct categories, ranging from cuisine to legal assistance on Reddit. SHP preferences represent the usefulness of one response over another given a certain context and two alternative responses.

Each scenario consists of a question/instruction posted on Reddit and two top-level comments, of which one is more popular than the other (collectively). The SHP algorithm takes advantage of the fact that a comment is favored more if it has a better score, even though it was written later. As A’s higher score could have been the effect of more visibility, we cannot draw this conclusion unless A was written before B.

This work has two distributions to work with here; the data in SHP is naturally occurring and human-written, while the responses in HH-RLHF are machine-written.

The team also published several preference models, or SteamSHPs, that are calibrated to determine which answer is most likely beneficial. Incredible FLAN-T5 models served as the inspiration for the SteamSHP preference models. They are ready to use for RLHF reward modeling and natural language processing (NLP) evaluation. Better on topics like legal counsel (80.7%) than philosophy (69.1%), SteamSHP-XL predicts human preference labels with 72.8% acc across all disciplines. 

As SteamSHPs may be utilized as scalar reward models, combining SHP and SteamSHP will be extremely useful in RLHF.  The team believes that SHP will be helpful in determining which human preferences are most effective in developing and refining a preference model. This could ultimately result in the collection of additional human preference data becoming much quicker and less expensive. For instance, improving the performance of the preference model on greater preferences supposedly improved performance because they contain more V-usable information about the preference label and offer a stronger signal.


Check out the Dataset. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 14k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.


Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.