Large Language Models can Accurately Predict Searcher Preferencess

2 minute read

Published: January 02, 2025

Search Preference LLM labeling paper

We focus on the search preference paper incidentally the authors work on bing search engine at microsoft.

Basic Idea

The main idea of the paper is to use an LLM as a judge, something initially described by researchers working on LLMs and not IR. The technique is nonetheless often found useful, indeed there are probably some other ways this method can be leveraged that have yet to be discovered.

Notes

Looking at LLMs as a judge is a large and varied topic, it may work well in some domains and poorly in others. It really depends on the specifics of the task requiring labels and the specifics of the system used to generate the labels. This paper focuses on the search domain with queries and relevance assessments.

In the paper, the authors investigate 5 prompt variations called R, D, N, A, and M. The authors looked at the various arrangements of these-each variation corresponds to some text that may be included in the prompt. For performance metrics they look at Cohen kappa, MAE and AUC using ia stratified sample of 3000 query-document pairs, cf. Table 1. To get confidence intervals on these metrics the authors bootstrapped. The DNA prompt variant, including D, N, and A substrings to the text, had the highest Cohen kappa of 0.64.

In Section 4.5 they also investigate if the Query difficulty correlates with human assessments using the Precision@10 metric. The idea is interesting and one critique here is that in practical search scenarios you may not have ground truth relevance assessments on which to determine relevances. They retrieve to depth 100 on a TREC dataset and use RBO-rank biased overlap-to compare the different lists. For the bing search engine the authors report:

“We have been using LLMs, in conjunction with expert human labellers, for most of our offline metrics since late 2022.”

and find good utility in doing so. The ground truth corpus they use comprises queries, descriptions of need, metadata like location and date, and at least two example results per query. Results are tagged—again, by the real searcher—as being good, neutral, or bad.

The bing team monitors the health of the system by sampling LLM judged results and going over those to monitor error and similar metrics. They also mention that in both the TREC & Bing datasets, the task descriptions are very clear and use a single LLM whereas Liang et al. saw large differences from model to model over a range of tasks.

Code for paper

None but a paper by folks at waterloo reimplements the idea.

Share on

Twitter Facebook LinkedIn

Lucas Roberts

Large Language Models can Accurately Predict Searcher Preferencess

Search Preference LLM labeling paper

Basic Idea

Notes

Code for paper

Share on

You May Also Enjoy

Future Blog Post

Notes from Ten Challenges for Making Automation a Team Player in Joint Human-Agent Activity

Summary

Notes from Several papers on Kaczmarz algorithm

Building a simple code editor in a web browser

Building a code editor for a browser or web client.

Lucas Roberts

Search Preference LLM labeling paper

Basic Idea

Notes

Code for paper

Share on

You May Also Enjoy

Future Blog Post

Notes from Ten Challenges for Making Automation a Team Player in Joint Human-Agent Activity

Summary

Notes from Several papers on Kaczmarz algorithm

Kaczmrz Algorithm and Related Algorithm Notes

Building a simple code editor in a web browser

Building a code editor for a browser or web client.