Florence-2-large

An advanced vision foundation model by Microsoft designed for a wide range of vision and vision-language tasks such as captioning, object detection, OCR, and segmentation. It uses a prompt-based, sequence-to-sequence transformer architecture pretrained on the FLD-5B dataset and supports both zero-shot and finetuned settings.

Key Information

  • Category: Vision Models
  • Source: Huggingface
  • Tags: image-text-to-text
  • Last updated: January 09, 2026

Structured Metrics

No structured metrics captured yet.

Links

Canonical source: https://huggingface.co/microsoft/Florence-2-large