Reka Flash 3 - AI Language Models Tool
Overview
Reka Flash 3 is an open-source, 21-billion-parameter general-purpose reasoning model released by Reka AI as a research preview. Trained from scratch on a mixture of public and synthetic corpora, the model was instruction-tuned and improved with a reinforcement-learning stage called RLOO (REINFORCE Leave-One-Out) that used both model-based and rule-based rewards. Reka positions Flash 3 as a compact, low-latency foundation for chat, coding, instruction following, and function-calling tasks; the model is distributed in a Llama-compatible format for easy integration into existing toolchains (Hugging Face model card, Reka blog). (Sources: https://huggingface.co/RekaAI/reka-flash-3, https://reka.ai/news/introducing-reka-flash) Reka Flash 3 also powers Reka’s commercial products (for example, the Nexus platform) and is the basis for multimodal and agentic variants used internally (Reka Research, Reka Vision). The model supports long contexts (32k tokens in the released research preview), an explicit “thinking” / budget-forcing mechanism (reasoning tags you can stop early), and several quantized distributions to enable on-device or lower-cost inference. The weights are released under the Apache-2.0 license, and Reka publishes supporting tools such as a quantization library and GGUF/llama.cpp-compatible quants. (Sources: https://reka.ai/news/introducing-reka-flash, https://reka.ai/news/reka-quantization-technology)
Model Statistics
- Downloads: 127
- Likes: 387
- Parameters: 20.9B
License: apache-2.0
Model Details
Architecture & size: Reka Flash 3 is a transformer-based causal LLM released as a 21B-parameter model (HF model card lists 21B / ~20.9B depending on packaging). The model uses the cl100k_base tokenizer and adds no additional special tokens; prompts follow a simple chat format (human: ... <sep> assistant: ... <sep>) and generation should stop on the string "<sep>" or the special token <|endoftext|>. (Source: https://huggingface.co/RekaAI/reka-flash-3) Training & alignment: The model was pretrained from scratch on diverse public and synthetic datasets, then instruction-finetuned on curated data and improved via RLOO (REINFORCE Leave-One-Out) using a mix of model-based and rule-based rewards. Reka emphasizes general improvement in reasoning and instruction-following rather than heavy persona or alignment tuning for this release. (Source: https://reka.ai/news/introducing-reka-flash) Context & runtime: The released research preview supports up to 32k token context length. Default tensor types and distribution artifacts are BF16 / FP16 for full-precision, and Reka provides multiple quantized variants (including 4-bit and 3.5-bit GGUF formats) for low-memory/on-device inference. Reka reports an FP16 full-precision footprint of ~39 GB and that 4-bit quantization can compress the model to ~11 GB. (Sources: https://reka.ai/news/introducing-reka-flash, https://huggingface.co/RekaAI/reka-flash-3) Reasoning & budget forcing: The model emits explicit reasoning tags (e.g., <reasoning>...</reasoning>) during internal thinking. Users can apply a budget-forcing mechanism to terminate the thinking trace early and force output; Reka reports that shorter budgets still yield reasonable answers on many reasoning benchmarks. (Source: https://huggingface.co/RekaAI/reka-flash-3) Quantization & deployment: Reka released Reka Quant, a quantization library and pre-quantized GGUF artifacts (Q3_K_S / Q3 formats). Their quantization approach uses calibrated error reduction and self-distillation to achieve near-lossless 3.5-bit quantization for Flash 3.1 variants, enabling llama.cpp and on-device workflows. (Source: https://reka.ai/news/reka-quantization-technology) Limitations & language support: Flash 3 is primarily optimized for English; Reka notes it is not the top choice for knowledge-intensive tasks (recommend coupling with web search) and that the model has not undergone extensive persona alignment. (Source: https://huggingface.co/RekaAI/reka-flash-3)
Key Features
- 21B-parameter reasoning model optimized for low-latency and on-device use.
- Supports long context windows (32,000 tokens) for long-form reasoning and documents.
- Budget-forcing: explicit reasoning tags you can truncate to control thinking time.
- Distributed in Llama-compatible format plus GGUF quantizations for llama.cpp and vLLM.
- Near-lossless low-bit quantization via Reka Quant (3.5-bit / 4-bit artifacts available).
Example Usage
Example (python):
from transformers import AutoTokenizer, AutoModelForCausalLM
# Quickstart (from the model card)
tokenizer = AutoTokenizer.from_pretrained("RekaAI/reka-flash-3")
model = AutoModelForCausalLM.from_pretrained("RekaAI/reka-flash-3", torch_dtype='auto', device_map='auto')
prompt = {"role": "human", "content": "Write a short poem about large language models."}
# use the chat template helper (model card shows apply_chat_template usage)
text = tokenizer.apply_chat_template([prompt], tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Notes: the model card documents using the 'human: ... <sep> assistant: ... <sep>' chat format
# and supports vLLM / llama.cpp / GGUF quantized distributions for low-memory deployment. Benchmarks
Model parameters: 21B parameters (Source: https://huggingface.co/RekaAI/reka-flash-3)
Full-precision footprint (FP16): ≈39 GB (fp16) (Source: https://reka.ai/news/introducing-reka-flash)
Quantized footprint (4-bit): ≈11 GB (4-bit quantized) (Source: https://reka.ai/news/introducing-reka-flash)
MMLU-Pro (reported): 65.0 (Source: https://reka.ai/news/introducing-reka-flash)
WMT'23 (COMET, reported): 83.2 (COMET) (Source: https://reka.ai/news/introducing-reka-flash)
Hugging Face popularity (likes / downloads last month): 387 likes; 128 downloads (last month) (Source: https://huggingface.co/RekaAI/reka-flash-3)
Key Information
- Category: Language Models
- Type: AI Language Models Tool