Creating an Evaluation Framework

evaluation

prompting

mmlu

Building out a framework that takes a model and a benchmark as input and returns the score as output.

Author

Charles Feinn

Published

September 2, 2024

Post Update

Last updated: April 30, 2025

Minor updates to this post include: * Added a simple try it in Colab link.

Here are links to the notebook and evaluation framework for this post. You can run the project in Google Colab using a L4 runtime.

Preface

I recently participated in the original cohort of Dan Becker and Hamel Husain’s Mastering LLMs: A Conference For Developers & Data Scientists. I eagerly sopped up every video, hungry for lessons from the experts at the cutting edge of the field.

After the course I felt I had gained a huge amount of knowledge, but I hadn’t gotten my hands dirty, yet. This blog is to change that.

Evaluations, which almost every speaker emphasized as being one of, if not the most, important piece of the LLM pipeline puzzle, seemed like a good place to start.

For my first project, I decided to create an evaluation framework to run models on benchmarks. To simplify things I decided to focus on Meta’s Llama 3.1 8B Instruct model and the Massive Multitask Language Understanding (MMLU) benchmark¹.

Gear

Intel-based Macbook Pro (2019)
VSCode
Google Colab
Terminal (home brew theme)
Jupyter Notebooks

I also used a Claude 3.5 Sonnet Project in its web interface to assist my thoughts and code.

Setup

To download our model we will use the Hugging Face Hub. This means that we will need to get and set a Hugging Face token to use in this project. Instructions for how to go about it can be found here.

Once we have our HF_TOKEN saved to our Google Colab secrets we can install and setup the Hugging Face Hub.

!pip install huggingface_hub

import os

# Import Colab Secrets userdata module
from google.colab import userdata

os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')

Why MMLU?

MMLU, developed by Hendrycks et al. in 2020, is a comprehensive benchmark that tests LLMs on a wide range of subjects in order to determine the breadth of their general knowledge. MMLU scores are typically presented, along side other popular benchmarks, when a new model is being touted.

More recently MMLU’s quality has been called into question, and many practitioners believe that its test set has ended up in the training data of many models, contaminating them, and calling into question its overall usefulness.

Given that, why did I choose to use it you ask?

I reasoned that though MMLU might not be the best evaluation for an LLM, it was one of the original popular and influential ones, and would likely serve as a good exemplar for creating a framework that was benchmark agnostic.

The evaluation framework

The framework is located at https://github.com/chuckfinca/evaluate.

The goal here was to create a framework that would take a model and a benchmark as input and return the model’s score as output.

My thinking was that if I could make it easy to run any model on any benchmark then I could use that framework in a pipeline that ran models through sets of existing and custom benchmarks alike.

The framework uses an orchestrator pattern in which a benchmark specific orchestrator class (e.g. MMLUBenchmarkOrchestrator) facilitates the evaluation of a model on a benchmark.

The heart of the MMLUBenchmarkOrchestrator is its subject evaluation function:

import torch
import numpy as np

def _eval_subject(self, subject, dev_df, test_df):
    cors = []
    preds = []
    probs = []

    for i in range(len(test_df)):
        prompt = self._format_prompt(dev_df, test_df, i)
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
        
        with torch.no_grad():
            outputs = self.model(**inputs)
        
        logits = outputs.logits[0, -1]
        probs_i = torch.nn.functional.softmax(logits, dim=-1)
        
        choice_probs = [probs_i[self.tokenizer.encode(choice, add_special_tokens=False)[0]].item() for choice in self.choices]
        pred = {0: "A", 1: "B", 2: "C", 3: "D"}[np.argmax(choice_probs)]
        
        probs.append(choice_probs)
        preds.append(pred)
        cors.append(pred == test_df.iloc[i, 5])

    acc = np.mean(cors)
    print(f"{subject} Accuracy: {acc:.3f}")

    return cors, acc, probs

The function takes dev and test data for a given MMLU subject and evaluates the model using the following steps:

Creates model prompts from evaluation questions and few-shot examples
Tokenizes prompts and moves them to model’s device
Runs inference using the prompts (aka. asks the model the questions)
Extracts probabilities from model outputs
Predicts answers based on highest probability
Calculates accuracy for the subject

To see all this in action let’s import the project (at the appropriate commit tag 24-09-02-creating-an-evaluation-framework):

!git clone --branch 24-09-02-creating-an-evaluation-framework https://github.com/chuckfinca/evaluate

Cloning into 'evaluate'...
remote: Enumerating objects: 261, done.
remote: Counting objects: 100% (261/261), done.
remote: Compressing objects: 100% (149/149), done.
remote: Total 261 (delta 150), reused 205 (delta 94), pack-reused 0 (from 0)
Receiving objects: 100% (261/261), 36.00 KiB | 18.00 MiB/s, done.
Resolving deltas: 100% (150/150), done.

NOTE: At some point I intend to make my evaluation framework into a python package in order to simplify some of this, but we’ll save that work for a later date.

The project has one dependency, so let’s import that:

!pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1

Evaluating Llama

Now that we’ve got our evaluation framework ready to go we might as well spice things up! Let’s see if we can reproduce Meta’s stated MMLU score of 73.0 for Llama 3.1 8B!

I haven’t yet been able to find a source in Meta’s documentation that states that they used the Instruct version of their models for benchmarking, but I believe this is common practice and so I chose to use the Meta-Llama-3.1-8B-Instruct model in my experiment.

Running our evaluation

To run our script we just need to supply a few things. Our benchmark and model name, and the type of few-shot learning we want to use.

In their blog post, Meta states that they used 0-shot (COT) to generate their 73.0 score on MMLU. To keep things uniform we will use 0 shot learning. COT (aka. chain-of-thought) is beyond the scope of this post, so we’ll leave that be for now.

A note about hardware:

I learned in the Mastering LLMs that a model generally requires 2 to 3x more RAM than it has billions of parameters. Our model has 8 billion parameters and so I expected to need between 16 and 24GB of GPU RAM to run the evaluation.

In practice I used about 18GB. This meant that I was able to run the evaluation using both the Google Colab A100 and L4 GPU runtimes which both have 20+ GB of GPU RAM.

The framework is also setup to use the CPU if cuda is not available. This worked on my local machine, but was too slow to be of any practical use.

Enough of that, let’s see the results!

!python evaluate/evaluate/main.py --benchmark mmlu --model meta-llama/Meta-Llama-3.1-8B-Instruct --nshot 0

Benchmark 'mmlu' has been set up successfully.
Running evaluation 'mmlu' with:
Model: meta-llama/Meta-Llama-3.1-8B-Instruct
Number of training examples: 0
Device: cuda
Using dtype: torch.float16
Loading model from /content/evaluate/evaluate/models/saved/meta-llama/Meta-Llama-3.1-8B-Instruct
Loading checkpoint shards: 100% 4/4 [00:00<00:00, 12.08it/s]
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
abstract_algebra Accuracy: 0.340
anatomy Accuracy: 0.607
astronomy Accuracy: 0.757
business_ethics Accuracy: 0.640
clinical_knowledge Accuracy: 0.792
college_biology Accuracy: 0.812
college_chemistry Accuracy: 0.500
college_computer_science Accuracy: 0.480
college_mathematics Accuracy: 0.410
college_medicine Accuracy: 0.699
college_physics Accuracy: 0.412
computer_security Accuracy: 0.740
conceptual_physics Accuracy: 0.604
econometrics Accuracy: 0.491
electrical_engineering Accuracy: 0.628
elementary_mathematics Accuracy: 0.460
formal_logic Accuracy: 0.540
global_facts Accuracy: 0.440
high_school_biology Accuracy: 0.813
high_school_chemistry Accuracy: 0.606
high_school_computer_science Accuracy: 0.650
high_school_european_history Accuracy: 0.739
high_school_geography Accuracy: 0.813
high_school_government_and_politics Accuracy: 0.850
high_school_macroeconomics Accuracy: 0.682
high_school_mathematics Accuracy: 0.344
high_school_microeconomics Accuracy: 0.744
high_school_physics Accuracy: 0.437
high_school_psychology Accuracy: 0.881
high_school_statistics Accuracy: 0.602
high_school_us_history Accuracy: 0.838
high_school_world_history Accuracy: 0.852
human_aging Accuracy: 0.695
human_sexuality Accuracy: 0.779
international_law Accuracy: 0.744
jurisprudence Accuracy: 0.778
logical_fallacies Accuracy: 0.761
machine_learning Accuracy: 0.509
management Accuracy: 0.835
marketing Accuracy: 0.880
medical_genetics Accuracy: 0.740
miscellaneous Accuracy: 0.834
moral_disputes Accuracy: 0.697
moral_scenarios Accuracy: 0.579
nutrition Accuracy: 0.771
philosophy Accuracy: 0.723
prehistory Accuracy: 0.713
professional_accounting Accuracy: 0.507
professional_law Accuracy: 0.462
professional_medicine Accuracy: 0.820
professional_psychology Accuracy: 0.676
public_relations Accuracy: 0.636
security_studies Accuracy: 0.747
sociology Accuracy: 0.841
us_foreign_policy Accuracy: 0.880
virology Accuracy: 0.536
world_religions Accuracy: 0.813
Score saved to: /content/evaluate/evaluate/benchmarks/benchmarks/mmlu/results/meta-llama/Meta-Llama-3.1-8B-Instruct/mmlu_score.txt
Average accuracy: 66.600

Et voila! We have our score.

Not what Meta reported, but not that far off either. I’ll leave the discrepancy for future posts :)

Here are links to the notebook and evaluation framework for this post.

Try it yourself!

Click the badge below or this link to open and run the code from this post directly in Google Colab. You can experiment with the parameters and see the results live.

Thanks for following along. If you’ve got a questions or comment please feel free to email me at chuckfinca at gmail dot com.

Footnotes

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring massive multitask language understanding. arXiv. https://arxiv.org/abs/2009.03300↩︎