LLM-as-a-judge on code search problems, model choice, and representations
Search Preference for code LLM labeling
In the project I built out an application to facilitate relevance annotations for text search queries of indexed repository code.
The repo with the application and some of the artifacts.
Process of creating, indexing, searching, and labeling code and queries
Creating structured code for Indexing
I used tree-sitter-parser to generate the indexed code.
The corpuses (or corpora if you prefer) are github repositories that contain collections of data structures in various languages.
We selected C, Python, Go, Javascript, and Java as 5 languages where it was possible to index the repos and they maintained a relatively close collection of data structures.
Table 2 in the manuscript compares the difference repositories for the data structures that are and are not common to the repositories by programming language.
Queries and Human Searching
The queries are constructed to be both specific to a given abstract data type and some that would apply across abstract data types. Queries are formulated to cover all three types mentioned by Broder, navigational, informational, and transactional. The queries are executed an relevances captured for all 5 repositories for human users.
Queries were executed on the same repositories using 3 different retrieval methods; BM25 (sparse), CodeBERT (semantic), and CodeT5+ (semantic). For the semantic methods we use cosine distance and for sparse retrieval we used the default hyperparameter settings in the bm25s package. Each retriever had the same collection of queries executed and relevances captured for all 5 repositories to facilitate comparisons.
Relevances, Human and LLM
The relevances are then generated by an host of LLM models both open source and proprietary. We used Nova-lite-1 from Amazon, GPT-4o-mini from OpenAI, Gemini-2.0 from Google and Llama-4 from meta.
We compare the results from the LLM models against the human preferences for each programming language (5) and each retriever (3) for a total of 15.
Scaling to other Programming Languages
One challenge is that many code datasets-regardless of whether they are search related or not-are only in a single programming language.
Adding additional programming languages to your dataset incurs a linear scaling cost-or so it is claimed.
To address this challenge we looked at applying transpilers, a source to source compiler where the two languages are higher leven languages.
While not all query-code pairs are transpilable, this is transpiler dependent and varies. We were able to transpile the Python query and result pairs from python to C for 50% of the records in the CosQA code search benchmark dataset.
The largest class of failures in transpilation are from language differences related to looping constructs, e.g. list comprehensions in python that cannot be naively transpiled to C code.
Findings
We find that in some cases the performance of the LLM models match the performance metrics of humans.
We also find that the choice of sparse vs semantic retrieval method impacts the performance of the LLM as a judge relevance alignment and which choice is better depends on the programming language being used.
The LLM as a judge was approximately 50% aligned with the relevances from CosQA after transpiling to C and this value was nearly the same regardless of the LLM as a judge model used.
Relevance of Research or Why should I care?
For those training code completion models it is common to use LLM as a judge to generate relevance data for training. This work suggests a more careful consideration of the relevance pipeline can give benefits to accuracy of the training data.
For those code repository hosting providers which support search features, this suggests that your search performance for human users might benefit from facets for language and coupling the models to the best performing languages to better satisfy customer needs.
If you’re looking to add a new programming language or increase coverage in a programming language in your training data or search benchmarks, transpilation is a possibility worth considering as a starting point for high quality data without the need for time consuming and costly additional data collection.
Additional experiments since paper acceptance
Subsequent experimentation, not in the paper suggests that for a large proportion of these failure modes custom rewrite rules on the python abstract syntax trees is possible so that the code can successfully be transpiled to C. Also, there exist other transpilers, e.g. py2many for python which support a broad range of target programming languages.
