Mistral-7B-v0.1 - AI Language Models Tool
Overview
Mistral-7B-v0.1 is a 7-billion-parameter pretrained generative language model released by Mistral AI (Sept 27, 2023). It is designed for high inference efficiency through architectural choices such as grouped-query attention (GQA) and sliding-window attention (SWA), and uses a byte-fallback BPE tokenizer. The original paper and model card report that Mistral‑7B outperforms Llama 2 13B across the benchmark suite the authors evaluated, while remaining compact enough for broader local and cloud deployment. ([ar5iv.org](https://ar5iv.org/pdf/2310.06825)) The model is published under the Apache‑2.0 license and distributed on Hugging Face (safetensors and PyTorch weights available). The Hugging Face repository includes a config and tokenizer files, and the authors provide reference code and guidance for running Mistral 7B; the model card also warns that this is a base model with no built-in moderation. Note that Mistral has since placed the original Mistral 7B versions into their model lifecycle (retired status listed with retirement date March 30, 2025), so check Mistral’s governance/lifecycle page before selecting a production version. ([huggingface.co](https://huggingface.co/mistralai/Mistral-7B-v0.1/tree/main))
Model Statistics
- Downloads: 337,727
- Likes: 4042
- Pipeline: text-generation
License: apache-2.0
Model Details
Architecture and core specs: Mistral‑7B is a dense transformer‑based causal LM with design optimizations for inference and long‑context handling. Public repo/config values list: hidden_size 4096, intermediate_size 14336, 32 transformer layers, 32 attention heads, 8 key/value heads, vocabulary size 32,000, and a max position embedding (context) size of 32,768 tokens (sliding_window default 4096). The repository uses bfloat16 weights for inference. ([huggingface.co](https://huggingface.co/mistralai/Mistral-7B-v0.1/blob/main/config.json)) Key architectural features: Grouped‑Query Attention (GQA) reduces KV projection costs and decoding memory use, enabling higher throughput and lower cache size; Sliding‑Window Attention (SWA) + rolling buffer cache permits effectively handling very long contexts with fixed cache memory; a byte‑fallback BPE tokenizer increases robustness on out‑of‑vocabulary inputs. The design tradeoff emphasizes throughput and practical latency for deployment while keeping competitive accuracy. Implementation and recommended tooling: authors link to a reference implementation and suggest using recent Transformers (>= 4.34.0) and common inference servers (vLLM / SkyPilot examples) for efficient deployment. The original paper includes full evaluation methodology and reproduction details. ([ar5iv.org](https://ar5iv.org/pdf/2310.06825))
Key Features
- Grouped‑Query Attention (GQA) for reduced KV memory and faster decoding.
- Sliding‑Window Attention (SWA) and rolling buffer cache for long contexts.
- Byte‑fallback BPE tokenizer to handle out‑of‑vocabulary and mixed inputs.
- Compact 7B model with performance comparable to larger open 13B/34B models on many benchmarks.
- Pretrained base weights under Apache‑2.0, with safetensors and PyTorch formats on Hugging Face.
Example Usage
Example (python):
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
model_id = 'mistralai/Mistral-7B-v0.1'
# Use a recent transformers version (>= 4.34.0) and set trust_remote_code if required by the repo.
# Load tokenizer and model (device_map='auto' helps place layers on available GPUs/CPU automatically).
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map='auto',
trust_remote_code=True
)
gen = pipeline('text-generation', model=model, tokenizer=tokenizer, device=0 if torch.cuda.is_available() else -1)
prompt = 'Write a short summary of the benefits of grouped‑query attention in one paragraph.'
outputs = gen(prompt, max_new_tokens=120, do_sample=False)
print(outputs[0]['generated_text'])
# Notes: the Hugging Face model card and the paper recommend Transformers 4.34.0+ and show troubleshooting tips.
# See the model card and paper for details on tokenizer files, config, and deployment recommendations. ([huggingface.co](https://huggingface.co/mistralai/Mistral-7B-v0.1/tree/main)) Benchmarks
MMLU (5-shot): 60.1% (Source: https://arxiv.org/abs/2310.06825)
HellaSwag (0-shot): 81.3% (Source: https://arxiv.org/abs/2310.06825)
Winogrande (0-shot): 75.3% (Source: https://arxiv.org/abs/2310.06825)
PIQA (0-shot): 83.0% (Source: https://arxiv.org/abs/2310.06825)
HumanEval (0-shot): 30.5% (Source: https://arxiv.org/abs/2310.06825)
MBPP (3-shot): 47.5% (Source: https://arxiv.org/abs/2310.06825)
GSM8K (8-shot): 52.2% (Source: https://arxiv.org/abs/2310.06825)
MATH (4-shot): 13.1% (Source: https://arxiv.org/abs/2310.06825)
ARC (Easy): 80.0% (Source: https://arxiv.org/abs/2310.06825)
NaturalQuestions (5-shot): 28.8% (Source: https://arxiv.org/abs/2310.06825)
Key Information
- Category: Language Models
- Type: AI Language Models Tool