BMVC ‘21 statistics: 40 (3.33%) oral / 437 (36.23%) accept / 1206 submissions
Humans typically perceive the establishment of an action in a video through the interaction between an actor and the surrounding environment. Despite the great progress in temporal action proposal generation, most existing works ignore the above fact and leave their model learning to propose actions as a black-box. In this paper, we make an attempt to simulate that ability of human by proposing Actor Environment Interaction (AEI) network to learn video visual representation for temporal action proposals generation. AEI contains two modules i.e. perception-based visual representation (PVR) and boundary matching module (BMM). PVR represents each video snippet by taking human-human relations and humans-environment relations into consideration using the proposed adaptive attention mechanism. Then, the video representation is taken by BMM to generate action proposals. AEI is comprehensively evaluated in ActivityNet-1.3 and THUMOS-14 datasets, on temporal action proposal and detection tasks, with two boundary matching architectures (i.e. CNN-based and GCN-based) and two classifiers (i.e. Unet and P-GCN). Our AEI shows significant improvement when regarding human logical thinking to extract spatio-temporal visual representation. Our AEI robustly outperforms SOTA methods with remarkable performance and generalization for both temporal action proposal generation and temporal action detection.