This paper presents TimeChat-Online, a novel online VideoLLM for efficient Streaming Video Understanding.
At its core is the innovative Differential Token Dropping (DTD) module that selectively preserves only significant temporal changes across continuous video streams.
The yellow-highlighted frames with few tokens dropped indicate significant video scene transitions.
The DTD module eliminates 82.8% of redundant video tokens, while achieving a 1.76x speedup in response latency and maintaining over 98% of original accuracy, revealing that over 80% of streaming video content is naturally redundant without any user-query guidance. Furthermore, it naturally monitors video scene transitions, facilitating online Proactive Responding.
When directly integrated with Qwen2.5VL-7B without additional training, DTD achieves a 5.7-point accuracy improvement while reducing video tokens by 84.6% on the challenging VideoMME long subset containing videos of 30-60 minutes. Furthermore, longer videos permit higher rates of redundant visual token dropping without performance degradation (up to 97.5% for the long subset).
The core of TimeChat-Online lies in its Differential Token Drop (DTD) module, designed to efficiently eliminate visual redundancy in streaming videos by only preserving significant temporal changes.
The drop ratio curve across the timeline naturally reveals video scene transitions, as frames with significant changes from previous frames have fewer dropped tokens, visualized as valleys (low drop ratio) in the curve. These transition points serve as natural "trigger times" for proactive responses, enabling the model to detect when meaningful new visual information becomes available without requiring additional perception modules.
This mechanism allows TimeChat-Online to achieve Proactive Response by autonomously identifying critical moments in streaming content and responding accordingly.
DTD adaptively reduces video tokens from a holistic video perspective, well-suited for both high-speed and slow-motion videos.
DTD maintains the fine-grained spatial-temporal positions of retained tokens, ensuring precise spatial localization and temporal understanding capabilities.
DTD efficiently processes video streams by calculating redundancy only for newly-arriving frames with faster speed, without re-processing historical video content.
To enable more flexible real-time interactions, we present TimeChat-Online-139K, a comprehensive streaming video dataset that encompasses backward-tracing, current-perception, and future-responding tasks across diverse online video scenarios.
Our dataset creation involved four key steps: (1) Collecting visually informative videos with diverse scene changes, (2) Generating scene-oriented dense captions using GPT-4o, (3) Creating streaming VideoQA samples based on these captions, and (4) Constructing negative samples for future-response training.
We conduct comprehensive experiments on both streaming video benchmarks (StreamingBench and OVO-Bench) and offline long-form video benchmarks (MLVU, LongVideoBench, VideoMME) to validate the effectiveness of TimeChat-Online.
On StreamingBench, TimeChat-Online achieves 56.56% accuracy with 82.6% token reduction, demonstrating state-of-the-art performance among online and offlineVideoLLMs. This significant token reduction of over 80% while maintaining high accuracy confirms that streaming videos contain substantial natural redundancy that can be effectively filtered.
Table 1: Performance comparison on StreamingBench full set including three categories: Real-Time Visual Understanding, Omni- Source Understanding and Contextual Understanding.
TimeChat-Online significantly outperforms existing online VideoLLMs across real-time perception, backward tracing, and forward responding tasks on OVO-Bench.
Table 2: Evaluation results on OVO-Bench comprising three categories: i) Real-Time Visual Perception (OCR: Optical Character Recognition, ACR: Action Recognition, ATR: Attribute Recognition, STU: Spatial Understanding, FPD: Future Prediction, OJR: Object Recognition), ii) Backward Tracing (EPM: Episodic Memory, ASI: Action Sequence Identification, HLD: Hallucination Detection), and iii) Forward Active Responding (REC: Repetition Event Count, SSR: Sequential Steps Recognition, CRR: Clues Reveal Responding).
Compared with existing online VideoLLMs, TimeChat-Online achieves superior performance on all long video benchmarks. It achieves up to 85.0% reduction in video tokens while maintaining or even improving performance across long-form video benchmarks. This demonstrates the effectiveness of our DTD approach for both streaming and offline video tasks.
When integrated with Qwen2.5-VL-7B without training, our DTD module improves VideoMME (long subset) accuracy by 5.7 points while reducing 84.6% of video tokens. Notably, higher token drop ratios consistently enhance performance, indicating that substantial vision redundancy in long videos can be eliminated to improve efficiency and understanding capabilities.
Table 3: Results on offline long video benchmarks. We report the accuracy on the MLVU, LongVideoBench and VideoMME(w/o subtitles). † indicates the reproduced results.
We present visualized examples demonstrating how TimeChat-Online processes streaming video content in real-time, highlighting the effectiveness of our Differential Token Dropping module and Proactive Response capability.
TimeChat-Online autonomously identifies significant scene transitions in streaming videos and generates proactive responses without requiring explicit user queries. The model precisely detects when meaningful new visual information becomes available, as illustrated by the drop ratio timeline, where valleys (aligned frames with yellow lightbulb icons) indicate substantial visual changes that trigger intelligent autonomous responses.
Comparison between feature-level (left) and pixel-level (right) token dropping. Feature-level approach with τfeat = 0.4 achieves a 58.3% drop ratio while effectively preserving visually important elements.
DTD dynamically adapts to different video content types. This example demonstrates that with the same threshold τfeat = 0.4, DTD achieves an impressive 89.5% drop ratio (compared to 58.3% in Case 2) for highly redundant drawing scenarios, highlighting its efficiency in content-adaptive token preservation.
Drop ratio timeline acts as a natural scene transition detector. Significant visual changes create valleys in the curve, identifying key video moments. Left: Scene transition with trigger time. Right: Drop ratio curve showing valleys at scene changes, with 0.85 threshold for increased token retention.
@misc{timechatonline,
title={TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos},
author={Linli Yao and Yicheng Li and Yuancheng Wei and Lei Li and Shuhuai Ren and Yuanxin Liu and Kun Ouyang and Lean Wang and Shicheng Li and Sida Li and Lingpeng Kong and Qi Liu and Yuanxing Zhang and Xu Sun},
year={2025},
eprint={2504.17343},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.17343},
}
Usage and License Notices: The data, code and checkpoints are intended and licensed for research use only. They are also restricted to uses that follow the license agreements of the respective datasets and models used in this work.
Related Projects: TimeChat, Qwen2.5VL, RLT, VideoLLM-online, OVOBench, StreamingBench