Florence-2-large
An advanced vision foundation model by Microsoft designed for a wide range of vision and vision-language tasks such as captioning, object detection, OCR, and segmentation. It uses a prompt-based, sequence-to-sequence transformer architecture pretrained on the FLD-5B dataset and supports both zero-shot and finetuned settings.
Key Information
- Category: Vision Models
- Source: Huggingface
- Tags: image-text-to-text
- Last updated: January 09, 2026
Structured Metrics
No structured metrics captured yet.
Links
Canonical source: https://huggingface.co/microsoft/Florence-2-large