EleutherAI lm-evaluation-harness

An open-source framework for evaluating language models across dozens of academic benchmarks. It provides a unified CLI and YAML-configurable tasks for few/zero-shot evaluation, supports multiple backends (Hugging Face Transformers, vLLM, SGLang, GPT‑NeoX/Megatron, and OpenAI‑style APIs), and includes features like Jinja2 prompt design, post‑processing/answer extraction, PEFT/LoRA adapter evaluation, and prototype multimodal tasks. Widely used for standardized LLM benchmarking and Open LLM Leaderboard tasks.

Key Information

  • Category: Evaluation and Monitoring
  • Source: Github
  • Last updated: January 09, 2026

Structured Metrics

No structured metrics captured yet.

Links

Canonical source: https://github.com/EleutherAI/lm-evaluation-harness