No-Frills Human-Object Interaction Detection: Factorization, Layout Encodings, and Training Techniques

Abstract
We show that for human-object interaction detection, a relatively simple factorized model with appearance and layout encodings constructed from pre-trained object detectors, outperforms more sophisticated approaches. Our model includes factors for detection scores, human and object appearance, and coarse (box-pair configuration) and optionally fine-grained layout (human pose).
We also develop training techniques that improve learning efficiency by: (1) eliminating a train-inference mismatch; (2) rejecting easy negatives during mini-batch training; and (3) using a ratio of negatives to positives that is two orders of magnitude larger than existing approaches. We conduct a thorough ablation study to understand the importance of different factors and training techniques using the challenging HICO-Det dataset.
Why do we call it a no-frills model?
We make several simplifications over existing approaches while achieving better performance owing to our choice of factorization, direct encoding and scoring of layout, and improved training techniques.
| Key Simplifications |
|---|
| Our model encodes appearance only using features extracted by an off-the-shelf object detector (Faster-RCNN pretrained on MS-COCO) |
| We only use simple hand coded layout encodings constructed from detected bounding boxes and human pose keypoints (pretrained OpenPose) |
| We use a fairly modest network architecture with light-weight multi-layer perceptrons (2-3 fully-connected layers) operating on the appearance and layout features mentioned above |
| No |
| No |
| No |
| No |
| No |
[1] Detecting and Recognizing Human-Object Interactions. CVPR 2018.
[2] Learning to Detect Human-Object Interactions. WACV 2018.
[3] iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection. BMVC 2018.
[4] Learning Human-Object Interactions by Graph Parsing Neural Networks. ECCV 2018.
Qualitative Results

Acknowledgment
This work was partly supported by the following grants and funding agencies. Many thanks!
- NSF 1718221
- ONR MURI N00014-16-1-2007
- Samsung
- 3M
