def _format_prompt(self, dev_df, test_df, test_idx):
= "Answer the following multiple choice questions. Choose the best answer from A, B, C, or D.\n\n"
prompt for i in range(len(dev_df)):
+= self._format_example(dev_df, i) + "\n\n"
prompt += self._format_example(test_df, test_idx, include_answer=False)
prompt return prompt
def _format_example(self, df, idx, include_answer=True):
= df.iloc[idx, 0]
prompt for j, choice in enumerate(self.choices):
+= f"\n{choice}. {df.iloc[idx, j+1]}"
prompt += "\nAnswer:"
prompt if include_answer:
+= f" {df.iloc[idx, 5]}"
prompt return prompt
Challenges in Reproducing LLM Evaluation Results
Introduction
In my previous post, Creating an Evaluation Framework, I created a simple evaluation framework for LLMs. I attempted to reproduce Meta’s Llama 3.1 8B Massive Multitask Language Understanding (MMLU)1 macro score of 73.02. TLDR: I couldn’t quite get there, but I did journey pretty deep into the messy world of LLM evaluation.
The Reproduction Challenge
I initially achieved a micro score of 66.6 and a macro score of 67.03. That’s a 6.0 point gap from Meta’s published score. The post focused on building out the evaluation framework so I didn’t spend time investigating, but it left me wondering.
6.0 points in MMLU is a lot. It represents almost half the difference between Meta’s 8B and 70B models (73.0 and 86.0 respectively).
When thinking about what could have caused this difference I remembered something that Hamel Husain said in Mastering LLMs: A Conference For Developers & Data Scientists. The gist was that you should never trust what is going in to your model. You need to look at it, understand any template that is being applied, and verify that is what you want. He even wrote a rather provocative blog post on the subject4.
Looking under the hood
Truth be told, I had Claude handle this part from my previous post. It had seemed to work, and wasn’t the focus of the post, so I hadn’t delved deeper. Now was the time to change that.
So, first things first, let’s look under the hood5:
It is very difficult to visualize the prompt from these functions. It is clear that there are various parts (instructions, examples, test, etc.), and that they are combined using new lines in some way, but the specifics are hard to grok from the code.
So I went about refactoring so that the prompt was configured in the config which informed the model:
basic_config.json:
{"benchmark_name": "mmlu",
"model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"nshot": 0,
"answer_choices": ["A", "B", "C", "D"],
"use_chat_template": false,
"system_prompt": "None",
"user_prompt_template": {
"template": "{instructions}\n\n{question}",
"instructions": "Answer the following multiple choice questions. Choose the best answer from {label_a}, {label_b}, {label_c}, or {label_d}.",
"question_template": "{question}\n{label_a}. {choice_a}\n{label_b}. {choice_b}\n{label_c}. {choice_c}\n{label_d}. {choice_d}\n\nAnswer: ",
"question_separator": "\n\n"
},"log_level": "INFO",
"cap_subjects" : false,
"generation_type": "inference"
}
The basic_config.json
results in a prompt with format:
The following are multiple choice questions (with answers) about {subject}.
{question}
A. {choice_a}
B. {choice_b}
C. {choice_c}
D. {choice_d}Answer:
Using this setup I was able to get a micro average of 66.76, which is just a hair better than my score of 66.6 from the original post.
With my baseline set, I went about trying to reproduce Meta’s reported 73.07.
Down the Rabbit Hole of Evaluation Methods
The Many Faces of Evaluation
There are a lot of ways that this evaluation can be performed. A few possibilities include:
1) Let the model complete its text response and extract the answer (a.k.a. open-ended generation)
2) Look at the logits after a single round of inference and pick the most probable answer
3) Constrain the logit choices to valid answers and pick the most probable (a.k.a. constrained decoding)
Each method has its pros and cons, involving trade-offs between computation cost, potential for the model to go off the rails, and the hassle of extracting answers from free-form text.
In Meta’s supplemental Llama 3 Evaluation Details they say that “the maximum generation lengths for the 5-shot and 0-shot configs are 10 tokens and 1024 tokens respectively”8. This leads me to believe that they used open-ended generation, which contrasted with the constrained decoding which I had been doing.
The Quest for the Perfect Prompt
I experimented with N-shot, Chain-of-Thought (CoT), and chat templates. I experimented with open-ended generation, looking at logits, and constrained decoding. For example, I tried using open-ended generation with the 0-shot chat template used in Sprague et al.’s To CoT or not to CoT (2024)9:
chat_template_config.json
{"benchmark_name": "mmlu",
"model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"nshot": 0,
"answer_choices": ["A", "B", "C", "D"],
"use_chat_template": true,
"system_prompt": "You answer questions. At the end of the question you always give an answer and nothing else. You must pick an answer. You always give only one answer and that one answer is the one you think is best. You always give the answer in the form of the answer choice letter.",
"user_prompt_template": {
"template": "{instructions}\n{question}",
"question_template": "{question}\n{label_a}. {choice_a}\n{label_b}. {choice_b}\n{label_c}. {choice_c}\n{label_d}. {choice_d}\n\nAnswer: ",
"question_separator": "\n\n",
"instructions": "Give your answer in the format \"The answer is therefore <{label_a}, {label_b}, {label_c}, {label_d}>\". Failure to comply with the answer formatting will result in no credit."
} }
To create a prompt that looked like:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You answer questions. At the end of the question you always give an answer and nothing else. You must pick an answer. You always give only one answer and that one answer is the one you think is best. You always give the answer in the form of the answer choice letter.<|eot_id|><|start_header_id|>user<|end_header_id|>Give your answer in the format "The answer is therefore <A, B, C, D>". Failure to comply with the answer formatting will result in no credit.
{question} A. {choice_a}
B. {choice_b}
C. {choice_c}
D. {choice_d}Answer:<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Result10? A measly 61.0. Back to the drawing board.
The original MMLU prompt
Turning back to Meta’s documentation I decided to look at the language used in my prompt. Meta stated that it had used the original MMLU prompt11, so I sought it out.
According to What’s going on with the Open LLM Leaderboard?, the standard MMLU prompt is12:
The following are multiple choice questions (with answers) about {subject}.
{question}
A. {choice_a}
B. {choice_b}
C. {choice_c}
D. {choice_d}
Answer:
To reproduce the prompt I created a new config:
original_mmlu_config.json
{"benchmark_name": "mmlu",
"model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"nshot": 0,
"answer_choices": ["A", "B", "C", "D"],
"use_chat_template": false,
"system_prompt": "",
"user_prompt_template": {
"template": "{instructions}\n{question}",
"instructions": "The following are multiple choice questions (with answers) about {subject}.",
"question_template": "{question}\n{label_a}. {choice_a}\n{label_b}. {choice_b}\n{label_c}. {choice_c}\n{label_d}. {choice_d}\nAnswer: ",
"question_separator": "\n\n"
},"log_level": "INFO",
"cap_subjects" : false,
"generation_type": "inference"
}
This simple approach yielded my best score yet: 68.313.
Wrapping My Head Around It All
The Prompt Sensitivity Conundrum
Throughout this journey, I learned that these models are surprisingly sensitive to prompts14. Tiny changes can lead to big swings in performance. What works on one model doesn’t work on another. It’s definitely more art than science.
The Evaluation Quagmire
The more I dug into this, the messier it got. Reproducibility issues, stochastic weirdness, and an uncomfortable reliance on prompt engineering made me question the whole evaluation game. Are we really measuring model capability, or just our ability to craft the perfect prompt?
Don’t even get me started on training on the test set.
A Ray of Hope
It’s not all doom and gloom, though. Third-party evaluators like HuggingFace’s Open LLM Leaderboard are doing the community a huge service by providing free, consistent, reproducible benchmarks allowing for relative comparisons between models. And ultimately evaluations are only a tool. The most important evaluations will be the custom ones that show you have the right model for your task.
Concluding Thoughts
This deep dive into MMLU evaluation left me with more questions than answers. But isn’t that how all good scientific inquiries go? We’ve got a long way to go in standardizing LLM evaluation, but finding a way to objectively assess these stochastic machines is incredibly important.
So, the next time you see a flashy headline about some model’s incredible benchmark performance, remember - there’s probably a lot of prompt engineering magic happening behind the curtain. And reproducing those results? Well, that’s a whole other can of worms.
Until next time.
Footnotes
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring massive multitask language understanding. arXiv. https://arxiv.org/abs/2009.03300↩︎
Meta. (2024, July 23). Introducing Llama 3.1: Our most capable models to date. Meta AI. https://ai.meta.com/blog/meta-llama-3-1/↩︎
NOTE:
The macro average is the average of each subject’s score.
The micro average is calculated by aggregating all subjects responses.
Meta’s reported score of 73.0 was a macro average score. My previous post reported the micro average.
In the rest of this post I will refer to the macro score unless otherwise noted.↩︎
Husain, H. (2024, February 14). Fuck You, Show Me The Prompt. hamel.dev. https://hamel.dev/blog/posts/prompt/↩︎
To reproduce:
↩︎git clone https://github.com/chuckfinca/evaluate.git cd evaluate git checkout 24-09-02-creating-an-evaluation-framework
To reproduce:
↩︎git clone https://github.com/chuckfinca/evaluate.git cd evaluate git checkout 3203ed4 python evaluate/evaluate/main.py basic_config.json
Meta. (2024, July 23). Introducing Llama 3.1: Our most capable models to date. Meta AI. https://ai.meta.com/blog/meta-llama-3-1/↩︎
Meta Llama. (2024, July 23). Llama 3 Evaluation Details [Computer software]. GitHub. https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/eval_details.md↩︎
Sprague, Z., Yin, F., Rodriguez, J. D., Jiang, D., Wadhwa, M., Singhal, P., Zhao, X., Ye, X., Mahowald, K., & Durrett, G. (2024). To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning. arXiv. https://doi.org/10.48550/arXiv.2409.12183↩︎
To reproduce:
↩︎git clone https://github.com/chuckfinca/evaluate.git cd evaluate git checkout 3203ed4 python evaluate/evaluate/main.py chat_template_config.json
Meta Llama. (2024, July 23). Llama 3 Evaluation Details [Computer software]. GitHub. https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/eval_details.md↩︎
Fourrier, C., Habib, N., Launay, J., Wolf, T. (2023, June 23). What’s Going on with the Open LLM Leaderboard. Hugging Face Blog. https://huggingface.co/blog/open-llm-leaderboard-mmlu↩︎
To reproduce:
↩︎git clone https://github.com/chuckfinca/evaluate.git cd evaluate git checkout 3203ed4 python evaluate/evaluate/main.py original_mmlu_config.json
Schulhoff, S., Ilie, M., Balepur, N., Kahadze, K., Liu, A., Si, C., Li, Y., Gupta, A., Han, H., Schulhoff, S., Dulepet, P. S., Vidyadhara, S., Ki, D., Agrawal, S., Pham, C., Kroiz, G., Li, F., Tao, H., Srivastava, A., Da Costa, H., Gupta, S., Rogers, M. L., Goncearenco, I., Sarli, G., Galynker, I., Peskoff, D., Carpuat, M., White, J., Anadkat, S., Hoyle, A., & Resnik, P. (2024). The Prompt Report: A Systematic Survey of Prompting Techniques. arXiv. https://arxiv.org/abs/2406.06608v3↩︎