Home › Vision Models › DeepSeek-OCR

DeepSeek-OCR - AI Vision Models Tool

Overview

DeepSeek-OCR is an open-weight, multilingual vision-language OCR model from DeepSeek designed to convert images and documents into structured text (for example, Markdown). The model is distributed as BF16 safetensors at approximately 3.3B parameters and is built to run with the Hugging Face Transformers stack and vLLM; it also exposes inference optimizations such as FlashAttention. The project page includes example inference code and multiple usage recommendations for running on GPUs with bfloat16 support. The model emphasizes a "context-aware optical compression" approach to preserve layout and context while extracting text across languages. Community adoption on Hugging Face is significant: the model card reports millions of downloads and thousands of likes, and the model is provided under an MIT license, making it feasible to integrate into research and commercial workflows (license permitting). According to the Hugging Face model page, DeepSeek-OCR is published with example code and ready-to-run artifacts to speed evaluation and deployment (source: Hugging Face model page).

Model Statistics

Downloads: 3,127,119
Likes: 3080
Pipeline: image-text-to-text
Parameters: 3.3B

License: mit

Model Details

DeepSeek-OCR is described on its Hugging Face model card as a multilingual vision-language OCR system that converts images/documents to text outputs (commonly Markdown). The distribution is provided as BF16 safetensors (approximately 3.3B parameters), which enables more efficient memory use on hardware that supports bfloat16. The model is packaged to run with the Transformers inference pipeline labeled for image-text-to-text workloads and can also be served with vLLM for lower-latency, production-scale inference. Runtime optimizations called out on the model page include FlashAttention support to accelerate attention operations. The model card lists no explicit base model (base model: None) and the repository provides example inference scripts and recommended settings for GPU-backed environments (source: Hugging Face model page).

Key Features

Open-weight, multilingual OCR model distributed as BF16 safetensors (~3.3B parameters)
Converts images/documents to structured text (examples target Markdown outputs)
Runs via Hugging Face Transformers pipeline (image-text-to-text) and supports vLLM
Inference optimizations include FlashAttention for faster attention computation
Provided under an MIT license with example inference code on the model page

Example Usage

Example (python):

import torch
from transformers import pipeline

# Replace with the Hugging Face model ID
MODEL_ID = "deepseek-ai/DeepSeek-OCR"

# Create an image-text-to-text pipeline. Use a GPU with bfloat16 if available.
pipe = pipeline(
    task="image-text-to-text",
    model=MODEL_ID,
    device=0,                     # set to -1 for CPU
    torch_dtype=torch.bfloat16    # requires hardware / torch support for bfloat16
)

image_path = "./document.jpg"  # path to your image or document image
# Run inference. Adjust generation parameters as needed.
outputs = pipe(image_path, max_new_tokens=512)

# The pipeline typically returns a list of result dicts containing the generated text
print(outputs[0].get("generated_text", outputs[0]))