Description: Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding
screen point-and-read (screenpr) (1) tree-of-lens (tol) (1) screen agent (1)
Our ToL agent describes the region on the screenshot indicated by a point from the user.
The generated descriptions includes important layout-information, which is critical because without it, one cannot distinguish the two identical "Tumbler pack" shown.
Graphical User Interfaces (GUIs) are central to our interaction with digital devices. Recently, growing efforts have been made to build models for various GUI understanding tasks. However, these efforts largely overlook an important GUI-referring task: screen reading based on user-indicated points, which we name the Screen Point-and-Read (ScreenPR) task. This task is predominantly handled by rigid accessible screen reading tools, in great need of new models driven by advancements in Multimodal Large Languag