Advances in computer vision, multimodal learning, and AR/XR technologies and smart glasses are converging toward Virtual Intelligent Task Assistants (VITAs)—systems that observe, interpret, and guide humans in complex real-world activities. This workshop discusses the most recent research to enable long-term task understanding and assistance. Topics include learning from long streaming videos, egocentric and exocentric video understanding, vision-language models, multimodal reasoning, task/step/procedure prediction, planning and correction, human-AI collaboration and coaching, and new datasets and benchmarks. By fostering dialogue across disciplines, the workshop aims to define the core challenges and opportunities for building practical and generalizable VITAs.
June 3, 2026
1:00 PM
Room 108
Colorado Convention Center
Denver, CO
[in Denver local time · MDT / UTC−6]