glee-vision.github.io - GLEE:General Object Foundation Model for Images and Videos at Scale

Description: General Object Foundation Model for Images and Videos at Scale

video (57012) foundation model (7) object perception (1)

Example domain paragraphs

We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework, GLEEaccomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from diverse data sources with varying supervision levels to formulate general object representations, excelling in zero-shot trans

1. We present GLEE, a general object-centric foundation model for images and videos. GLEE is capable of addressing a wide range of object-centric tasks simultaneously while maintaining state-of-the-art performance .

2. We develop a multi-granularity joint supervision framework and a scalable training paradigm. The unified approach of GLEE supports multi-source data and enables joint training on over five million images from various benchmarks with diverse supervision levels. This significantly facilitates the incorporation of additional manually or automatically annotated data, and simplifies the scaling of the dataset .

Links to glee-vision.github.io (2)