Challenges in Reproducing LLM Evaluation Results

evaluation
prompting
mmlu
A Case Study with MMLU and Llama 3
Author

Charles Feinn

Published

October 3, 2024

Introduction

In my previous post, Creating an Evaluation Framework, I created a simple evaluation framework for LLMs. I attempted to reproduce Meta’s Llama 3.1 8B Massive Multitask Language Understanding (MMLU)1 macro score of 73.02. TLDR: I couldn’t quite get there, but I did journey pretty deep into the messy world of LLM evaluation.

The Reproduction Challenge

I initially achieved a micro score of 66.6 and a macro score of 67.03. That’s a 6.0 point gap from Meta’s published score. The post focused on building out the evaluation framework so I didn’t spend time investigating, but it left me wondering.

6.0 points in MMLU is a lot. It represents almost half the difference between Meta’s 8B and 70B models (73.0 and 86.0 respectively).

When thinking about what could have caused this difference I remembered something that Hamel Husain said in Mastering LLMs: A Conference For Developers & Data Scientists. The gist was that you should never trust what is going in to your model. You need to look at it, understand any template that is being applied, and verify that is what you want. He even wrote a rather provocative blog post on the subject4.

Looking under the hood

Truth be told, I had Claude handle this part from my previous post. It had seemed to work, and wasn’t the focus of the post, so I hadn’t delved deeper. Now was the time to change that.

So, first things first, let’s look under the hood5:

def _format_prompt(self, dev_df, test_df, test_idx):
    prompt = "Answer the following multiple choice questions. Choose the best answer from A, B, C, or D.\n\n"
    for i in range(len(dev_df)):
        prompt += self._format_example(dev_df, i) + "\n\n"
    prompt += self._format_example(test_df, test_idx, include_answer=False)
    return prompt

def _format_example(self, df, idx, include_answer=True):
    prompt = df.iloc[idx, 0]
    for j, choice in enumerate(self.choices):
        prompt += f"\n{choice}. {df.iloc[idx, j+1]}"
    prompt += "\nAnswer:"
    if include_answer:
        prompt += f" {df.iloc[idx, 5]}"
    return prompt

It is very difficult to visualize the prompt from these functions. It is clear that there are various parts (instructions, examples, test, etc.), and that they are combined using new lines in some way, but the specifics are hard to grok from the code.

So I went about refactoring so that the prompt was configured in the config which informed the model:

basic_config.json:

{
    "benchmark_name": "mmlu",
    "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "nshot": 0,
    "answer_choices": ["A", "B", "C", "D"],
    "use_chat_template": false,
    "system_prompt": "None",
    "user_prompt_template": {
        "template": "{instructions}\n\n{question}",
        "instructions": "Answer the following multiple choice questions. Choose the best answer from {label_a}, {label_b}, {label_c}, or {label_d}.",
        "question_template": "{question}\n{label_a}. {choice_a}\n{label_b}. {choice_b}\n{label_c}. {choice_c}\n{label_d}. {choice_d}\n\nAnswer: ",
        "question_separator": "\n\n"
    },
    "log_level": "INFO",
    "cap_subjects" : false,
    "generation_type": "inference"
}

The basic_config.json results in a prompt with format:

The following are multiple choice questions (with answers) about {subject}.

{question}
A. {choice_a}
B. {choice_b}
C. {choice_c}
D. {choice_d}

Answer:

Using this setup I was able to get a micro average of 66.76, which is just a hair better than my score of 66.6 from the original post.

With my baseline set, I went about trying to reproduce Meta’s reported 73.07.

Down the Rabbit Hole of Evaluation Methods

The Many Faces of Evaluation

There are a lot of ways that this evaluation can be performed. A few possibilities include:

1) Let the model complete its text response and extract the answer (a.k.a. open-ended generation)
2) Look at the logits after a single round of inference and pick the most probable answer
3) Constrain the logit choices to valid answers and pick the most probable (a.k.a. constrained decoding)

Each method has its pros and cons, involving trade-offs between computation cost, potential for the model to go off the rails, and the hassle of extracting answers from free-form text.

In Meta’s supplemental Llama 3 Evaluation Details they say that “the maximum generation lengths for the 5-shot and 0-shot configs are 10 tokens and 1024 tokens respectively”8. This leads me to believe that they used open-ended generation, which contrasted with the constrained decoding which I had been doing.

The Quest for the Perfect Prompt

I experimented with N-shot, Chain-of-Thought (CoT), and chat templates. I experimented with open-ended generation, looking at logits, and constrained decoding. For example, I tried using open-ended generation with the 0-shot chat template used in Sprague et al.’s To CoT or not to CoT (2024)9:

chat_template_config.json

{
    "benchmark_name": "mmlu",
    "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "nshot": 0,
    "answer_choices": ["A", "B", "C", "D"],
    "use_chat_template": true,
    "system_prompt": "You answer questions. At the end of the question you always give an answer and nothing else. You must pick an answer. You always give only one answer and that one answer is the one you think is best. You always give the answer in the form of the answer choice letter.",
    "user_prompt_template": {
        "template": "{instructions}\n{question}",
        "question_template": "{question}\n{label_a}. {choice_a}\n{label_b}. {choice_b}\n{label_c}. {choice_c}\n{label_d}. {choice_d}\n\nAnswer: ",
        "question_separator": "\n\n",
        "instructions": "Give your answer in the format \"The answer is therefore <{label_a}, {label_b}, {label_c}, {label_d}>\". Failure to comply with the answer formatting will result in no credit."
    }
}

To create a prompt that looked like:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You answer questions. At the end of the question you always give an answer and nothing else. You must pick an answer. You always give only one answer and that one answer is the one you think is best. You always give the answer in the form of the answer choice letter.<|eot_id|><|start_header_id|>user<|end_header_id|>

Give your answer in the format "The answer is therefore <A, B, C, D>". Failure to comply with the answer formatting will result in no credit.

{question} A. {choice_a}
B. {choice_b}
C. {choice_c}
D. {choice_d}

Answer:<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Result10? A measly 61.0. Back to the drawing board.

The original MMLU prompt

Turning back to Meta’s documentation I decided to look at the language used in my prompt. Meta stated that it had used the original MMLU prompt11, so I sought it out.

According to What’s going on with the Open LLM Leaderboard?, the standard MMLU prompt is12:

The following are multiple choice questions (with answers) about {subject}.
{question}
A. {choice_a}
B. {choice_b}
C. {choice_c}
D. {choice_d}
Answer:

To reproduce the prompt I created a new config:

original_mmlu_config.json

{
    "benchmark_name": "mmlu",
    "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "nshot": 0,
    "answer_choices": ["A", "B", "C", "D"],
    "use_chat_template": false,
    "system_prompt": "",
    "user_prompt_template": {
        "template": "{instructions}\n{question}",
        "instructions": "The following are multiple choice questions (with answers) about {subject}.",
        "question_template": "{question}\n{label_a}. {choice_a}\n{label_b}. {choice_b}\n{label_c}. {choice_c}\n{label_d}. {choice_d}\nAnswer: ",
        "question_separator": "\n\n"
    },
    "log_level": "INFO",
    "cap_subjects" : false,
    "generation_type": "inference"
}

This simple approach yielded my best score yet: 68.313.

Wrapping My Head Around It All

The Prompt Sensitivity Conundrum

Throughout this journey, I learned that these models are surprisingly sensitive to prompts14. Tiny changes can lead to big swings in performance. What works on one model doesn’t work on another. It’s definitely more art than science.

The Evaluation Quagmire

The more I dug into this, the messier it got. Reproducibility issues, stochastic weirdness, and an uncomfortable reliance on prompt engineering made me question the whole evaluation game. Are we really measuring model capability, or just our ability to craft the perfect prompt?

Don’t even get me started on training on the test set.

A Ray of Hope

It’s not all doom and gloom, though. Third-party evaluators like HuggingFace’s Open LLM Leaderboard are doing the community a huge service by providing free, consistent, reproducible benchmarks allowing for relative comparisons between models. And ultimately evaluations are only a tool. The most important evaluations will be the custom ones that show you have the right model for your task.

Concluding Thoughts

This deep dive into MMLU evaluation left me with more questions than answers. But isn’t that how all good scientific inquiries go? We’ve got a long way to go in standardizing LLM evaluation, but finding a way to objectively assess these stochastic machines is incredibly important.

So, the next time you see a flashy headline about some model’s incredible benchmark performance, remember - there’s probably a lot of prompt engineering magic happening behind the curtain. And reproducing those results? Well, that’s a whole other can of worms.

Until next time.

Footnotes

  1. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring massive multitask language understanding. arXiv. https://arxiv.org/abs/2009.03300↩︎

  2. Meta. (2024, July 23). Introducing Llama 3.1: Our most capable models to date. Meta AI. https://ai.meta.com/blog/meta-llama-3-1/↩︎

  3. NOTE:

    The macro average is the average of each subject’s score.

    The micro average is calculated by aggregating all subjects responses.

    Meta’s reported score of 73.0 was a macro average score. My previous post reported the micro average.

    In the rest of this post I will refer to the macro score unless otherwise noted.↩︎

  4. Husain, H. (2024, February 14). Fuck You, Show Me The Prompt. hamel.dev. https://hamel.dev/blog/posts/prompt/↩︎

  5. To reproduce:

    git clone https://github.com/chuckfinca/evaluate.git
    cd evaluate  
    git checkout 24-09-02-creating-an-evaluation-framework
    ↩︎
  6. To reproduce:

    git clone https://github.com/chuckfinca/evaluate.git
    cd evaluate  
    git checkout 3203ed4
    python evaluate/evaluate/main.py basic_config.json
    ↩︎
  7. Meta. (2024, July 23). Introducing Llama 3.1: Our most capable models to date. Meta AI. https://ai.meta.com/blog/meta-llama-3-1/↩︎

  8. Meta Llama. (2024, July 23). Llama 3 Evaluation Details [Computer software]. GitHub. https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/eval_details.md↩︎

  9. Sprague, Z., Yin, F., Rodriguez, J. D., Jiang, D., Wadhwa, M., Singhal, P., Zhao, X., Ye, X., Mahowald, K., & Durrett, G. (2024). To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning. arXiv. https://doi.org/10.48550/arXiv.2409.12183↩︎

  10. To reproduce:

    git clone https://github.com/chuckfinca/evaluate.git
    cd evaluate  
    git checkout 3203ed4
    python evaluate/evaluate/main.py chat_template_config.json
    ↩︎
  11. Meta Llama. (2024, July 23). Llama 3 Evaluation Details [Computer software]. GitHub. https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/eval_details.md↩︎

  12. Fourrier, C., Habib, N., Launay, J., Wolf, T. (2023, June 23). What’s Going on with the Open LLM Leaderboard. Hugging Face Blog. https://huggingface.co/blog/open-llm-leaderboard-mmlu↩︎

  13. To reproduce:

    git clone https://github.com/chuckfinca/evaluate.git
    cd evaluate  
    git checkout 3203ed4
    python evaluate/evaluate/main.py original_mmlu_config.json
    ↩︎
  14. Schulhoff, S., Ilie, M., Balepur, N., Kahadze, K., Liu, A., Si, C., Li, Y., Gupta, A., Han, H., Schulhoff, S., Dulepet, P. S., Vidyadhara, S., Ki, D., Agrawal, S., Pham, C., Kroiz, G., Li, F., Tao, H., Srivastava, A., Da Costa, H., Gupta, S., Rogers, M. L., Goncearenco, I., Sarli, G., Galynker, I., Peskoff, D., Carpuat, M., White, J., Anadkat, S., Hoyle, A., & Resnik, P. (2024). The Prompt Report: A Systematic Survey of Prompting Techniques. arXiv. https://arxiv.org/abs/2406.06608v3↩︎