Post

视频理解

视频可以看成4D Tensor :$T\times 3 \times H \times W$

占用内存很大: 30 frames/s,每个frames 有640*480个像素,每个像素三个channal,每个channal占1个字节

所以有$3\times 640 \times 480 \times 30$个字节约等于26MB/s

Idea1:在低画质,低帧率大短视频片段上训练,对视频进行降采样处理.

Idea2: Train normal 2D CNN to classify video frames independently(Often a very strong baseline for video classification)

Idea3:Late Fusion

figure1

figure1

Idea4:Early Fusion

figure1

Idea5:3D CNN Feature Extraction

figure1

This post is licensed under CC BY 4.0 by the author.