Home › Vision Models › Florence-2-large

Florence-2-large - AI Vision Models Tool

Overview

Florence-2-large is a vision foundation model from Microsoft designed to handle a broad set of vision and vision–language tasks. It uses a prompt-based, sequence-to-sequence transformer architecture pretrained on the FLD-5B dataset and is intended to work both zero-shot and in finetuned settings. The model targets tasks such as image captioning, object detection, OCR, and segmentation, and is exposed on Hugging Face as an image-text-to-text pipeline for easy integration into applications. Because Florence-2-large is implemented as a promptable seq2seq model, developers can feed natural-language prompts to guide outputs (for example: "Describe this image", "List detected text", "Segment the person(s)"). According to the Hugging Face model card, Florence-2-large is distributed under the MIT license and is published with a parameter footprint of roughly 776.7M, making it suitable for production use where a balance of capability and resource efficiency is required. Hugging Face usage statistics indicate active community interest, which can make it easier to find examples and community-contributed prompts and wrappers.

Model Statistics

Downloads: 681,070
Likes: 1734
Pipeline: image-text-to-text
Parameters: 776.7M

License: mit

Model Details

Architecture and training: Florence-2-large is a prompt-based, sequence-to-sequence transformer pretrained on the FLD-5B dataset, designed as a multimodal encoder-decoder that accepts image inputs and produces text outputs. The model is surfaced on Hugging Face with a pipeline type of image-text-to-text and a published parameter count of approximately 776.7 million parameters. The model card on Hugging Face lists its license as MIT and does not indicate a separate base model dependency. Capabilities: Florence-2-large supports zero-shot inference for many vision-language tasks and can be finetuned for domain-specific objectives. Typical supported task families include image captioning, scene text recognition (OCR), object detection / bounding-box description (via prompt or finetune), and segmentation outputs when adapted to the task. Because it is promptable, developers can use natural-language prompts to steer behavior without changing model weights. For deployment, the model is available via Hugging Face model hosting and can be used with the Hugging Face Inference API or local inference stacks that support the image-text-to-text pipeline. Source: According to the Hugging Face model card for microsoft/Florence-2-large (https://huggingface.co/microsoft/Florence-2-large).

Key Features

Promptable sequence-to-sequence transformer for flexible image-to-text outputs.
Pretrained on the large FLD-5B multimodal dataset for broad visual knowledge.
Supports zero-shot inference and finetuning for task specialization.
Exposed as an image-text-to-text pipeline on Hugging Face for easy integration.
MIT license enabling permissive reuse in commercial and research projects.

Example Usage

Example (python):

from huggingface_hub import InferenceApi
from PIL import Image
import requests

# Replace with your HF token if using a private quota; for public inference remove token
inference = InferenceApi(repo_id="microsoft/Florence-2-large")

# Example: caption an image by passing an image URL
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/image_captioning.png"
image_bytes = requests.get(image_url).content

inputs = {
    "inputs": image_bytes,
    # You can include a text prompt to guide output, e.g. "Describe this image in one sentence."
    "parameters": {"prompt": "Describe this image in one concise sentence."}
}

result = inference(inputs)
print(result)

# The returned result will typically be a text string or list of candidate strings depending on the API response.