视频理解
视频可以看成4D Tensor :$T\times 3 \times H \times W$
占用内存很大: 30 frames/s,每个frames 有640*480个像素,每个像素三个channal,每个channal占1个字节
所以有$3\times 640 \times 480 \times 30$个字节约等于26MB/s
Idea1:在低画质,低帧率大短视频片段上训练,对视频进行降采样处理.
Idea2: Train normal 2D CNN to classify video frames independently(Often a very strong baseline for video classification)
Idea3:Late Fusion
Idea4:Early Fusion
Idea5:3D CNN Feature Extraction
This post is licensed under CC BY 4.0 by the author.



