WizardLM - AI Language Models Tool

Overview

WizardLM is an open-source family of LLaMA-derived instruction-following models engineered to handle complex chat, multilingual reasoning, coding, and mathematics tasks. The project centers on an automated instruction-generation pipeline called Auto Evol-Instruct: an iterative, LLM-driven optimizer that analyzes evolution trajectories and automatically synthesizes higher-difficulty, more diverse instruction data for instruction tuning. The team says Auto Evol-Instruct plus an Arena Learning training loop substantially expanded the dataset and task diversity used to build WizardLM-2 and related variants. ([huggingface.co](https://huggingface.co/posts/WizardLM/574698793995338?utm_source=openai)) The WizardLM ecosystem includes general instruction models (WizardLM), code-specialized variants (WizardCoder), and math-focused models (WizardMath) across multiple parameter scales (examples: 7B, 13B, 33B/34B, 70B and MoE 8x22B-style releases). Individual model cards and release notes on Hugging Face report competitive benchmark numbers (HumanEval / pass@1 for code models and GSM8K / MATH for math models) and publish reproducible evaluation scripts and prompts for researchers. Community-maintained quantized builds (GGUF/AWQ/GPTQ) and inference recipes for vLLM/transformers/llama.cpp are widely available. For authoritative details, consult the WizardLM Hugging Face pages and the Auto Evol-Instruct announcement/paper. ([huggingface.co](https://huggingface.co/WizardLM/WizardMath-70B-V1.0?utm_source=openai))

Key Features

  • Auto Evol-Instruct: iterative LLM optimizer that evolves instruction datasets automatically.
  • Arena Learning: trains from a large, high-difficulty pool of synthesized instruction examples.
  • Multiple specialized lines: WizardLM (general), WizardCoder (code), WizardMath (math reasoning).
  • Published checkpoints across scales: community-available 7B, 13B, 33/34B, 70B, and MoE variants.
  • Open model cards with reproducible eval scripts and recommended prompt formats on Hugging Face.

Example Usage

Example (python):

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Example: single-GPU inference (adjust model name per required checkpoint)
model_name = "WizardLM/WizardLM-13B-V1.2"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)
model.eval()

# Prompt format: a short conversation-style instruction (model cards provide recommended prompts)
prompt = (
    "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers.\n"
    "USER: Summarize the main differences between Auto Evol-Instruct and manual instruction tuning.\n"
    "ASSISTANT:"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.2)

generated = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(generated)

# Notes: for large checkpoints consider vLLM, text-generation-inference, or quantized GGUF/AWQ builds for cost-efficient serving.

Benchmarks

WizardMath-7B — GSM8K (pass@1): 83.2% pass@1 (Source: https://huggingface.co/WizardLM/WizardMath-7B-V1.1)

WizardCoder-33B — HumanEval (pass@1): 79.9% pass@1 (Source: https://huggingface.co/WizardLMTeam/WizardCoder-33B-V1.1)

WizardLM-70B — MT-Bench (automatic GPT-4-based score): 7.78 (MT-Bench aggregated score reported on model card) (Source: https://huggingface.co/WizardLM/WizardLM-70B-V1.0)

WizardLM-13B — MT-Bench (automatic GPT-4-based score): 7.06 (MT-Bench aggregated score reported on model card) (Source: https://huggingface.co/WizardLM/WizardLM-13B-V1.2)

WizardMath-70B — GSM8K (pass@1): 81.6% pass@1 (Source: https://huggingface.co/WizardLM/WizardMath-70B-V1.0)

Last Refreshed: 2026-03-03

Key Information

  • Category: Language Models
  • Type: AI Language Models Tool