Home › Language Models › STILL-3-Tool-32B

STILL-3-Tool-32B - AI Language Models Tool

Overview

STILL-3-Tool-32B is an open-source, 32–33B-parameter causal text-generation model created by the RUCAIBox STILL team to improve multi-step mathematical and reasoning performance by integrating Python code generation and tool manipulation into its inference process. The project emphasizes a "slow-thinking" workflow where the model can generate short executable Python fragments (or other tool calls) as part of its chain-of-thought, use those to check intermediate results, and then continue reasoning—an approach the authors show improves accuracy on math competition benchmarks. The model is published on Hugging Face and accompanied by the project’s code, training data, and detailed technical notes in the RUCAIBox repository and project pages. (Sources: Hugging Face model page and RUCAIBox GitHub project.) In evaluations reported by the authors, STILL-3-Tool-32B achieves strong results on AIME 2024 (reported 81.70% under the model card summary; additional evaluation details and alternative decoding results are described in the project repository). The model is released in safetensors format with BF16 weights and intended for research and local deployment; it is not (as of the model card) hosted by an external inference provider. The code, data, and model artifacts have been open-sourced to support reproducibility and follow-up work.

Model Statistics

Downloads: 4
Likes: 5
Pipeline: text-generation
Parameters: 32.8B

Model Details

Architecture and size: STILL-3-Tool-32B is a dense causal language model at ~32.8–33B parameters, distributed on Hugging Face in safetensors format with BF16 weights. The Hugging Face model page lists the model as a text-generation pipeline artifact and tags it in the qwen2 family space. (See the model card.) Training and capabilities: The RUCAIBox project describes a focused training pipeline that includes supervised fine-tuning and reinforcement-learning-based methods to elicit 'tool manipulation' behavior—teaching the model to produce short Python code snippets and structured tool calls as part of its reasoning process. The repository documents evaluation protocols (sampling settings and greedy decoding comparisons) and describes how tool-integrated reasoning examples are included in the training mix to improve multi-step mathematical reasoning and verification. The authors report separate evaluation conditions (sampling with temperature/top-p and repeated sampling vs. greedy decoding) which produce different numeric outcomes; these evaluation choices are documented in the project repository. Inputs/outputs and formats: The model is shipped in safetensors, expects standard causal LM tokenized input (tokenizer from the model repo context), and is primarily targeted at text-generation tasks with specialized examples for code-integrated reasoning and math problem solving. The model card indicates it is not currently deployed via Hugging Face Inference Providers, so local or self-hosted inference is the primary usage mode.

Key Features

Code-integrated reasoning: generates short Python snippets to check intermediate steps.
Open-source release: model, training code, and datasets published by RUCAIBox.
Math-competition focus: tuned and evaluated on AIME and other math benchmarks.
Shipped in safetensors with BF16 weights for research and local deployment.
Multiple evaluation modes: authors report sampling-based and greedy decoding results.

Example Usage

Example (python):

from transformers import pipeline

# Example: basic text-generation usage (local inference). For large models, use accelerate/device_map.
# Replace with appropriate device_map/quantization for your hardware.

generator = pipeline(
    "text-generation",
    model="RUC-AIBOX/STILL-3-TOOL-32B",
    device_map="auto",
    torch_dtype=None  # set e.g. torch.bfloat16 if supported by your runtime
)

prompt = (
    "You are a reasoning assistant that may generate short Python checks.\n"
    "Question: Evaluate and return the integer result of 12 choose 4.\n"
    "Answer:"
)

outputs = generator(prompt, max_new_tokens=200, do_sample=False)
print(outputs[0]["generated_text"])

Benchmarks

AIME 2024 (reported on Hugging Face model card): 81.70% accuracy (Source: https://huggingface.co/RUC-AIBOX/STILL-3-TOOL-32B)

AIME 2024 (greedy decoding reported in repository): 86.67% accuracy (greedy search, repository claim) (Source: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs)

Model size / format: ~32.8–33B parameters; BF16; safetensors (Source: https://huggingface.co/RUC-AIBOX/STILL-3-TOOL-32B)

Training / evaluation protocol (sampling setup cited): Sampling eval: temperature=0.6, top_p=0.95, 64 samples per question (authors' reported protocol) (Source: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs)

Training / data artifact: Accompanying training/evaluation dataset (STILL-3-TOOL-32B-Data), ~820 rows (open dataset) (Source: https://huggingface.co/datasets/RUC-AIBOX/STILL-3-TOOL-32B-Data)

Last Refreshed: 2026-02-24

HuggingFace

Key Information

Category: Language Models
Type: AI Language Models Tool

Visit Official Website

STILL-3-Tool-32B - AI Language Models Tool

Overview

Model Statistics

Model Details

Key Features

Example Usage

Benchmarks

Key Information

Related Tools

Qwen2.5-7B

DeepSeek-V3

Llama 3

UNfilteredAI-1B

Shuttle-3

WizardLM