EleutherAI lm-evaluation-harness
An open-source framework for evaluating language models across dozens of academic benchmarks. It provides a unified CLI and YAML-configurable tasks for few/zero-shot evaluation, supports multiple backends (Hugging Face Transformers, vLLM, SGLang, GPT‑NeoX/Megatron, and OpenAI‑style APIs), and includes features like Jinja2 prompt design, post‑processing/answer extraction, PEFT/LoRA adapter evaluation, and prototype multimodal tasks. Widely used for standardized LLM benchmarking and Open LLM Leaderboard tasks.
Key Information
- Category: Evaluation and Monitoring
- Source: Github
- Last updated: January 09, 2026
Structured Metrics
No structured metrics captured yet.
Links
Canonical source: https://github.com/EleutherAI/lm-evaluation-harness