Pairwise Ranking Prompting (PRP) is a New Technique that Lightens the Load on LLMs.

Compared to their supervised counterparts, which may be trained with millions of labeled examples, Large Language Models (LLMs) like GPT-3 and PaLM have shown impressive performance on various natural language tasks, even in the zero-shot setting. However, utilizing LLMs to solve the basic text ranking problem has had mixed results. Existing findings often perform noticeably worse than trained baseline rankers. The lone exception is a new strategy that relies on the massive, black box, and commercial GPT-4 system. 

They argue that relying on such black box systems is not ideal for academic researchers due to significant cost constraints and access limitations to these systems. However, they do acknowledge the value of such explorations in demonstrating the capability of LLMs for ranking tasks. Ranking metrics can drop by over 50% when the input document order changes. In this study, they first explain why LLMs struggle with ranking problems when using the pointwise and listwise formulations of the current approaches. Since generation-only LLM APIs (like GPT-4) do not enable this, ranking for pointwise techniques necessitates LLMs to produce calibrated prediction probabilities before sorting, which is known to be exceedingly challenging.

LLMs frequently provide inconsistent or pointless outputs, even with instructions that seem extremely obvious to humans for listwise techniques. Empirically, they discover that listwise ranking prompts from prior work provide results on medium-sized LLMs that are entirely meaningless. These findings demonstrate that current, widely used LLMs need to comprehend ranking tasks, possibly due to their pre-training and fine-tuning techniques’ lack of ranking awareness. To considerably reduce task complexity for LLMs and address the calibration issue, researchers from Google Research propose the pairwise ranking prompting (PRP) paradigm, which employs the query and a pair of documents as the prompt for rating tasks. PRP is founded on a straightforward prompt architecture and offers both generation and scoring LLMs APIs by default. 

They discuss several PRP variations to answer concerns about efficiency. PRP results are the first in the literature to use moderate-sized, open-sourced LLMs on traditional benchmark datasets to achieve state-of-the-art ranking performance. On the TREC-DL2020, PRP based on the 20B parameter FLAN-UL2 model exceeds the prior best method in the literature, based on the black box commercial GPT-4 with (an estimated) 50X model size, by more than 5% at NDCG@1. On TREC-DL2019, PRP can beat current solutions, such as InstructGPT, which has 175B parameters, by over 10% for practically all ranking measures, but it only performs worse than the GPT-4 solution on the NDCG@5 and NDCG@10 metrics. Additionally, they present competitive outcomes using FLAN-T5 models with 3B and 13B parameters to illustrate the effectiveness and applicability of PRP.

They also review PRP’s additional advantages, such as its support for LLM APIs for scoring and generation and its insensitivity to input orders. In conclusion, this work makes three contributions: 

? They demonstrate pairwise ranking prompting works well for zero-shot ranking using LLMs for the first time. Their findings are based on moderate-sized, open-sourced LLMs, compared with existing systems that employ black box, commercial, and considerably bigger models. 

? It can produce state-of-the-art ranking performance using straightforward prompting and scoring mechanisms. Future studies in this area will be made more accessible by the discovery. 

? While achieving linear complexity, they examine several efficiency enhancements and demonstrate good empirical performance.