VideoCube: A general benchmark for visual tracking intelligence evaluation

We apply an eye tracker machine to record and quantify the human visual tracking ability. The intelligence level of trackers can be measured by comparing the human capacity with algorithm tracking results.

Ten videos (A-J) with different difficulty, duration, instance types, space classes, motion modes are played to the subject at three speeds (15FPS, 20FPS, and 30FPS). Fifteen subjects track the test videos at three rates.

The experiment consists of three steps:

The subject calibrates the eye tracker machine (Tobii Eye Tracker) to ensure that the instrument can accurately detect the sightline.
A TEST video appears in the screen center. The subject should focus on the target in the first frame, press the play button, and concentrate on maintaining tracking accuracy. TEST video aims to help subjects familiarize themselves with the test process.
The subject begins the formal experiment with six different videos. To ensure the effectiveness of the experiment, the subject needs to take a break in two videos.

Demo

Eye tracking experiment process

Comparison of human and SOTA algorithm

Experiment Result

Above figure presents the precision plots of humans and 20 trackers in OPE mechanisms.Turing_15, Turing_20, and Turing_30 represent human scores at 15FPS, 20FPS, and 30FPS, respectively.

We can draw the following conclusions through comparison:

The calculation methods and sequencing principles of traditional precision (PRE) scores have multiple problems. PRE measures the center distance between the predicted result and the ground-truth in pixels, but ignores the impact of target size and video resolution (for detailed analysis, please refer to the methods chapter). This makes the ranking threshold with 20 pixels unreasonable. In the precision plot, human performance is far lower than algorithms, contrary to our common sense. Since the deviation of the eye tracker machine may exceed 20 pixels in several situations, 20 pixels are too strict by comparing with the image resolution and target size of videos in VideoCube.
The normalized precision plot shows that the human visual tracking ability is worse than tracking algorithms for strict precision requirements. The reason may be the deviation of the eye tracker machine and the human attention (for person target, subjects prone to focus on the head instead of the torso). When the accuracy requirements are moderately reduced, the human visual ability will quickly exceed algorithms and remain stable.

Instructions - Eye Tracking Experiment

Eye Tracking Experiment

Demo

Eye tracking experiment process

Comparison of human and SOTA algorithm

Experiment Result