A general benchmark for visual tracking and visual-language tracking intelligence evaluation

VideoCube is a high-quality and large-scale benchmark to create a challenging real-world experimental environment for Global Instance Tracking (GIT). MGIT is a high-quality and multi-modal benchmark based on VideoCube-Tiny to fully represent the complex spatio-temporal and causal relationships coupled in longer narrative content.


Global Instance Tracking (GIT) task aims to model the fundamental visual function of humans for motion perception without any assumptions about camera or motion consistency.

Key Features


VideoCube contains 500 video segments of real-world moving objects and over 7.4 million labeled bounding boxes. We guarantee that each video contains at least 4008 frames, and the average frame length in VideoCube is around 14920.

Multiple Collection Dimension

The collection of VideoCube is based on six dimensions to describe the spatio-temporal relationship and causal relationship of film narrative, which provides an extensive dataset for the novel GIT task.

Comprehensive Attribute Selection

VideoCube provides 12 attributes for each frame to reflect the challenging situations in actual applications, and implement a more elaborate reference for the performance analysis.

Scientific Evaluation

VideoCube provides classical metrics and novel metrics for to evaluation algorithms. Besides, this benchmark also provides human baseline to measure the intelligence level of existing methods.

Multi-granularity Semantic Annotation

MGIT design a hierarchical multi-granular semantic annotation strategy to provide scientific natural language information. Video content is annotated by three grands (i.e., action, activity, and story).

Evaluation Mechanism for Multi-modal Tracking

MGIT expand the evaluation mechanism by conducting experiments under both traditional evaluation mechanisms (multi-modal single granularity, single visual modality) and evaluation mechanisms adapted to MGIT (multi-modal multi-granularity).

Latest News



Global Instance Tracking: Locating Target More Like Humans.
S. Hu, X. Zhao*, L. Huang and K. Huang (*corresponding author)
IEEE Transactions on Pattern Analysis and Machine Intelligence
[PDF] [ArXiV] [BibTex]

Please cite our IEEE TPAMI paper if VideoCube helps your research.


A Multi-modal Global Instance Tracking Benchmark (MGIT):
Better Locating Target in Complex Spatio-temporal and Casual Relationship.
S. Hu, D. Zhang, M. Wu, X. Feng, X. Li, X. Zhao and K. Huang
Conference on Neural Information Processing Systems
[PDF] [BibTex]

Please cite our NeurIPS paper if MGIT helps your research.


VideoCube Benchmark

MGIT Benchmark


  • Shiyu Hu, Center for Research on Intelligent System and Engineering (CRISE), CASIA.
  • Xin Zhao, Center for Research on Intelligent System and Engineering (CRISE), CASIA.
  • Lianghua Huang, Center for Research on Intelligent System and Engineering (CRISE), CASIA.
  • Kaiqi Huang, Center for Research on Intelligent System and Engineering (CRISE), CASIA.


  • Xuchen Li, Center for Research on Intelligent System and Engineering (CRISE), CASIA.


Please contact us if you have any problems or suggestions.

© 2022-2023 Copyright.