No-Frills Human-Object Interaction Detection: Factorization, Layout Encodings, and Training Techniques

Tanmay Gupta Alexander Schwing Derek Hoiem
International Conference on Computer Vision (ICCV). 2019

Outputs of pretrained object and human-pose detectors provide strong cues for predicting interactions. Top: human and object boxes, object label, and human pose predicted by Faster-RCNN and OpenPose respectively. We encode appearance and layout using these predictions (and Faster-RCNN features) and use a factored model to detect human-object interactions. Bottom: boxes and pose overlaid on the input image.

Abstract

We show that for human-object interaction detection, a relatively simple factorized model with appearance and layout encodings constructed from pre-trained object detectors, outperforms more sophisticated approaches. Our model includes factors for detection scores, human and object appearance, and coarse (box-pair configuration) and optionally fine-grained layout (human pose).

We also develop training techniques that improve learning efficiency by: (1) eliminating a train-inference mismatch; (2) rejecting easy negatives during mini-batch training; and (3) using a ratio of negatives to positives that is two orders of magnitude larger than existing approaches. We conduct a thorough ablation study to understand the importance of different factors and training techniques using the challenging HICO-Det dataset.

Why do we call it a no-frills model?

We make several simplifications over existing approaches while achieving better performance owing to our choice of factorization, direct encoding and scoring of layout, and improved training techniques.

Key Simplifications
Our model encodes appearance only using features extracted by an off-the-shelf object detector (Faster-RCNN pretrained on MS-COCO)
We only use simple hand coded layout encodings constructed from detected bounding boxes and human pose keypoints (pretrained OpenPose)
We use a fairly modest network architecture with light-weight multi-layer perceptrons (2-3 fully-connected layers) operating on the appearance and layout features mentioned above
No ~~Mixture-Density Network~~ [1] or CNN for encoding ~~Interaction Patterns~~ [2]
No ~~multi-task learning~~ [1]
No ~~fine-tuning~~ object/pose detector [1]
No ~~attention mechanisms~~ for modeling context [3]
No ~~message-passing~~ over graphs [4]

References
[1] Detecting and Recognizing Human-Object Interactions. CVPR 2018.
[2] Learning to Detect Human-Object Interactions. WACV 2018.
[3] iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection. BMVC 2018.
[4] Learning Human-Object Interactions by Graph Parsing Neural Networks. ECCV 2018.

Qualitative Results

Qualitative results showing top ranking true and false positives for different HOI categories with predicted probability. The blue and red boxes correspond to human and objects detected by pretrained Faster-RCNN detector respectively. Pose skeleton consists of 18 keypoints predicted by the pretrained OpenPose detector and assigned to the human box.

Acknowledgment

This work was partly supported by the following grants and funding agencies. Many thanks!

NSF 1718221
ONR MURI N00014-16-1-2007
Samsung
3M

Abstract

Why do we call it a no-frills model?

Qualitative Results

Acknowledgment

Templates (for web app):

Error