BLOOM - AI Language Models Tool
Overview
BLOOM is an open-access, decoder-only Transformer language model with 176 billion parameters developed by the BigScience collaborative workshop. It was trained on the ROOTS corpus (≈341.6 billion tokens) and can generate text across 46 natural languages and 13 programming languages. The project purposefully released model weights, many intermediary checkpoints, and optimizer states to support reproducible research and downstream fine-tuning. ([arxiv.org](https://arxiv.org/abs/2211.05100?utm_source=openai)) Designed for research and responsible deployment, BLOOM is distributed under the BigScience Responsible AI License (RAIL), which permits wide reuse while imposing use-based restrictions intended to reduce high-risk misuse. The release is tightly integrated with the Hugging Face ecosystem (model cards, tokenizers, datasets, and evaluation artifacts) and includes training logs and evaluation results so researchers can reproduce and inspect training and evaluation behaviour. BLOOM has also spawned instruction-tuned and community variants (e.g., BLOOMZ and downstream chat/instruction models) built on the open checkpoints. ([bigscience.huggingface.co](https://bigscience.huggingface.co/blog/the-bigscience-rail-license?utm_source=openai))
Key Features
- 176B-parameter open-access decoder-only Transformer (GPT-like) foundation model.
- Pretrained on ROOTS: ~341.6B tokens spanning 46 human and 13 programming languages.
- Released under the Responsible AI License (RAIL) with use-based restrictions.
- Public intermediary checkpoints, full optimizer states, and training logs available.
- Integrated model card, datasets, and evaluation artifacts on the Hugging Face Hub.
Example Usage
Example (python):
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# NOTE: bigscience/bloom is a 176B-parameter model. Loading requires high-memory GPUs
# or device sharding (Accelerate / Transformers device_map) and may not fit local machines.
model_name = "bigscience/bloom"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
# Load model (this will attempt to download large weights). Use device_map='auto' for HF accelerate.
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto", # requires accelerate + supported hardware
torch_dtype=torch.bfloat16, # recommended when supported
trust_remote_code=False # set True only if model card requires it
)
# Simple generation
prompt = "Write a brief description of BLOOM in one paragraph:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=120, do_sample=True, temperature=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) Benchmarks
Parameter count: 176 billion parameters (Source: https://arxiv.org/abs/2211.05100)
Training tokens (ROOTS corpus): ≈341.6 billion tokens (Source: https://huggingface.co/bigscience/tr11-176B-logs)
Languages covered (pretraining): 46 natural languages + 13 programming languages (Source: https://arxiv.org/abs/2211.05100)
HumanEval (code generation) pass@1 (reported): 0.155 (BLOOM-176B, zero-shot HumanEval pass@1 — reported in model README/eval artifacts) (Source: https://huggingface.co/bigscience/bloom)
Final train/validation/perplexity (final checkpoint): Training loss 1.939 / Validation loss 2.061 / Perplexity 7.045 (Source: https://huggingface.co/bigscience/bloom)
Key Information
- Category: Language Models
- Type: AI Language Models Tool