Home › Evaluation & Observability › EleutherAI lm-evaluation-harness

EleutherAI lm-evaluation-harness - AI Evaluation & Observability Tool

Overview

EleutherAI lm-evaluation-harness is an open-source evaluation framework that standardizes benchmarking of language models across a broad set of academic tasks. The harness provides a unified CLI and YAML-configurable task definitions, enabling reproducible few-shot and zero-shot evaluations with a consistent prompt/answer-extraction pipeline. It is widely used for standardized LLM benchmarking and as the evaluation backend for community leaderboards such as the Open LLM Leaderboard (per the project repository). The project supports multiple execution backends (including Hugging Face Transformers, vLLM, GPT-NeoX/Megatron-style runtimes, and OpenAI-style APIs) and includes features for Jinja2-based prompt templates, post-processing and answer extraction, and evaluation of parameter-efficient fine-tuning (PEFT) adapters like LoRA. The harness ships with implementations for many standard benchmarks (for example MMLU, SuperGLUE, HellaSwag, TruthfulQA, ARC, and Winogrande) and is structured so contributors can add new tasks, formats, and backends via YAML and Python task files. According to the GitHub repository, the project focuses on reproducibility, configurability, and breadth of supported academic benchmarks.

Installation

Install via pip:

git clone https://github.com/EleutherAI/lm-evaluation-harness.git

cd lm-evaluation-harness

pip install -e .

export OPENAI_API_KEY="<your_api_key>"  # when using OpenAI-style backends

Key Features

Unified CLI and YAML task configs for reproducible, scriptable evaluations.
Jinja2-based prompt templates to define few-shot and zero-shot prompts programmatically.
Built-in postprocessing and answer-extraction utilities for diverse task formats.
Multi-backend support: Hugging Face Transformers, vLLM, GPT-NeoX/Megatron, OpenAI-style APIs.
Out-of-the-box benchmarks: MMLU, SuperGLUE, HellaSwag, TruthfulQA, ARC, Winogrande.
PEFT/LoRA evaluation support for adapter and weight-efficient fine-tuning experiments.
Prototype multimodal tasks and extensible task definitions for custom datasets.

Community

The project is community-driven and maintained under the EleutherAI organization on GitHub. Development and contributions commonly appear as new tasks, backends, bug fixes, and benchmarking additions. Users coordinate through the repository’s Issues and (where enabled) Discussions; EleutherAI community channels also serve for broader conversation. The harness is frequently referenced in community leaderboards and research workflows, and contributors submit PRs to extend task coverage, add backend integrations, and improve reproducibility. According to the GitHub repository, ongoing maintenance focuses on expanding backend support, PEFT evaluation, and keeping benchmark implementations up to date.

Last Refreshed: 2026-01-09

Key Information

Category: Evaluation & Observability
Type: AI Evaluation & Observability Tool

Visit Official Website

EleutherAI lm-evaluation-harness - AI Evaluation & Observability Tool

Overview

Installation

Key Features

Community

Key Information

Related Tools

Lighteval

AI-DEBAT

DeepEval

Dataset-to-Model Monitor

seismometer

Dataset to Model Monitor