Gemma 3 27B Instruct (google/gemma-3-27b-it) - AI Language Models Tool

Overview

Gemma 3 27B Instruct (google/gemma-3-27b-it) is Google DeepMind’s instruction‑tuned, multimodal 27‑billion‑parameter member of the Gemma 3 family. It accepts text and images (image‑text‑to‑text pipeline), supports a very long context window (128K tokens for the 27B/12B/4B sizes), and produces text outputs for tasks such as question answering, document summarization, reasoning, code generation, and image understanding. The instruction‑tuned variant provides chat-style templates and generation utilities for structured outputs and function‑calling workflows. ([huggingface.co](https://huggingface.co/google/gemma-3-27b-it)) Designed for developer accessibility, Gemma 3 emphasizes a small‑footprint, high‑capability tradeoff: the family ships as open weights under Google’s Gemma license, can be run via Hugging Face Transformers (supported from transformers>=4.50), and is optimized for single‑GPU and multi‑GPU inference scenarios. The model includes safety and filtering pipelines applied during pretraining and post‑training alignment steps, and the release is accompanied by a technical report that documents architectural changes (local/global attention interleaving), vision encoder design (SigLIP + Pan & Scan), and evaluation across language, STEM, code and multimodal benchmarks. ([huggingface.co](https://huggingface.co/google/gemma-3-27b-it))

Model Statistics

  • Downloads: 1,684,676
  • Likes: 1896
  • Pipeline: image-text-to-text

License: gemma

Model Details

Architecture and modalities: Gemma 3 follows a decoder‑only transformer family augmented for vision by a SigLIP vision encoder that converts images into a sequence of 256 soft tokens (images normalized to 896×896 by default) so the language decoder can attend to visual content while still producing text outputs. The instruction‑tuned 27B model is produced from pre‑trained checkpoints using distillation, RLHF/RLMF/RLEF style alignment, and other post‑training recipes described in the team’s technical report. ([hyper.ai](https://hyper.ai/en/papers/2503.19786?utm_source=openai)) Long context and attention: To enable a 128K token context without prohibitive KV‑cache growth, Gemma 3 interleaves many local attention layers (short span, e.g., 1024 tokens) with sparser global attention layers (paper reports a 5:1 local:global interleave for memory efficiency). This hybrid design reduces KV memory while preserving long‑range reasoning and retrieval over book‑length inputs. ([hyper.ai](https://hyper.ai/en/papers/2503.19786?utm_source=openai)) Tokenization, training budget, and sizes: the family uses an improved multilingual tokenizer (large vocab) and the 27B variant was trained with ~14 trillion tokens; other sizes include 1B, 4B, 12B and the 27B instruction‑tuned model discussed here. Vision encoder and parameter splits are described in the technical report. Gemma 3 models were trained on Google TPUs using JAX/ML Pathways. ([huggingface.co](https://huggingface.co/google/gemma-3-27b-it)) Inference and integration: The model is exposed via Hugging Face as an image-text-to-text pipeline and provides AutoProcessor + Gemma3ForConditionalGeneration primitives for single/multi‑GPU inference (device_map="auto" and bfloat16 recommended for performance). Instruction‑tuned usage requires chat templates for role/content formatting. ([huggingface.co](https://huggingface.co/google/gemma-3-27b-it))

Key Features

  • Multimodal input: accepts images and text, outputs text for multimodal QA and captioning.
  • 128K token context window for 4B/12B/27B sizes—book‑length context handling.
  • SigLIP vision encoder converts images into 256 soft tokens for integrated vision reasoning.
  • Instruction‑tuned chat templates and function‑calling support for structured outputs.
  • Optimized local/global attention interleave to reduce KV‑cache memory for long context.
  • Available as open weights (Gemma license) and runnable via Hugging Face Transformers.
  • Training and evaluation show strong multilingual and STEM/code performance improvements.

Example Usage

Example (python):

from transformers import pipeline

# Example: image-text-to-text pipeline (instruction-tuned chat template)
pipe = pipeline(
    "image-text-to-text",
    model="google/gemma-3-27b-it",
    device="cuda",
    torch_dtype="bfloat16"
)

messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {"role": "user", "content": [
        {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
        {"type": "text", "text": "What animal is on the candy?"}
    ]}
]

# Run pipeline (instruction-tuned models require chat templates)
output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])

Benchmarks

HellaSwag (10-shot): 85.6 (Source: https://huggingface.co/google/gemma-3-27b-it)

BoolQ (0-shot): 82.4 (Source: https://huggingface.co/google/gemma-3-27b-it)

MMLU (5-shot): 78.6 (Source: https://huggingface.co/google/gemma-3-27b-it)

HumanEval (0-shot): 48.8 (Source: https://huggingface.co/google/gemma-3-27b-it)

DocVQA (val): 85.6 (Source: https://huggingface.co/google/gemma-3-27b-it)

COCO captioning (COCOcap): 116 (Source: https://huggingface.co/google/gemma-3-27b-it)

Gemma 3 technical report (architectural claims): 128K context, SigLIP vision encoder, local:global attention interleave (Source: https://arxiv.org/abs/2503.19786)

Last Refreshed: 2026-02-24

Key Information

  • Category: Language Models
  • Type: AI Language Models Tool