STILL-3-Tool-32B - AI Language Models Tool

Overview

STILL-3-Tool-32B is an open-source, code-assisted text-generation model (≈32.8–33B parameters) designed to improve multi-step reasoning by integrating executable Python during its chain-of-thought. The authors frame the approach as “tool manipulation”: the model is trained and prompted to emit small, verifiable Python snippets (with code output blocks) inside reasoning traces so that numeric work and empirical checks are executed as part of the solution process. The project release includes the model weights, training data, and supporting code to reproduce the code-integrated reasoning pipeline. ([huggingface.co](https://huggingface.co/RUC-AIBOX/STILL-3-TOOL-32B)) In math reasoning evaluation, the model reports 81.70% accuracy on the AIME 2024 benchmark — the authors note this matches reported o3-mini results on that test and outperforms o1 and DeepSeek-R1 baselines in their comparisons. STILL-3-Tool-32B is distributed as safetensors (BF16) on Hugging Face and is accompanied by a GitHub repository and a detailed Notion report documenting data construction, prompt formats, and training recipes. ([huggingface.co](https://huggingface.co/RUC-AIBOX/STILL-3-TOOL-32B))

Model Statistics

  • Downloads: 2
  • Likes: 5
  • Pipeline: text-generation
  • Parameters: 32.8B

Model Details

Architecture and format: STILL-3-Tool-32B is a dense causal Transformer at roughly 32.8–33 billion parameters, published on Hugging Face in safetensors BF16 format with quantized variants provided. The model card shows the model is intended for transformers-style text-generation pipelines. ([huggingface.co](https://huggingface.co/RUC-AIBOX/STILL-3-TOOL-32B)) Code-integrated reasoning: The key design difference is explicit training and prompting for Python code emission inside chain-of-thought: each non-trivial computational step is followed by a ```python``` block and an ```output``` block that records executable results. The team distilled and synthesized training examples (including distillation from DeepSeek-R1 and a Distill-Qwen variant) to produce long-chain solutions that include runnable code. Their Notion and repository describe the prompt template and data filtering rules used to build the dataset. ([lake-bayberry-173.notion.site](https://lake-bayberry-173.notion.site/Empowering-Reasoning-Models-with-Wings-Tool-Manipulation-Significantly-Enhances-the-Reasoning-Abili-1a6ab1cf72428023a105c16eec90968e)) Training / data notes: The authors used a mixture of distilled long-CoT examples, synthesized data and RL/finetuning experiments across the STILL project to emphasize tool-use behaviors (details and datasets are linked from the repo and Notion page). The repo and model card provide links to the training data and experimental reports for reproducibility. ([github.com](https://github.com/RUCAIBox/Slow_Thinking_with_LLMs))

Key Features

  • Executable Python integrated into chain-of-thought for verifiable numeric steps.
  • 32.8–33B parameter dense causal Transformer in safetensors BF16 format.
  • Training data distilled from DeepSeek-R1 and synthesized Qwen distill variants.
  • Open-source release: model, dataset, and training code published on Hugging Face and GitHub.
  • Quantized variants available for more efficient local inference.

Example Usage

Example (python):

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

# NOTE: this model is large — run on a machine with adequate GPU memory or use device_map="auto".
model_id = "RUC-AIBOX/STILL-3-TOOL-32B"

# Load tokenizer and model (trust_remote_code may be required for model-backed tokenizers/configs)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
)

gen = pipeline("text-generation", model=model, tokenizer=tokenizer)

prompt = (
    "Solve the problem. Use Python blocks for calculations as needed and show output.\n"
    "Question: If 3x+5=20, what is x?"
)
response = gen(prompt, max_new_tokens=256, do_sample=False)
print(response[0]["generated_text"])

Benchmarks

AIME 2024 accuracy: 81.70% (Source: https://huggingface.co/RUC-AIBOX/STILL-3-TOOL-32B (model card) and Notion project report.)

Relative comparison reported by authors: Matches o3-mini; outperforms o1 and DeepSeek-R1 on AIME 2024 (per authors) (Source: Authors' Notion report and Hugging Face model card.)

Last Refreshed: 2026-01-16

Key Information

  • Category: Language Models
  • Type: AI Language Models Tool