High-Resolution Representation Learning for ImageNet Classification |
Ke Sun     Yang Zhao     Borui Jiang     Tianheng Cheng     Bin Xiao     Dong Liu     Yadong Mu     Xinggang Wang     Wenyu Liu     Jingdong Wang |
We augment the HRNet with a classification head shown in the figure below. First, the four-resolution feature maps are fed into a bottleneck and the number of output channels are increased to 128, 256, 512, and 1024, respectively. Then, we downsample the high-resolution representations by a 2-strided 3x3 convolution outputting 256 channels and add them to the representations of the second-high-resolution representations. This process is repeated two times to get 1024 channels over the small resolution. Last, we transform 1024 channels to 2048 channels through a 1x1 convolution, followed by a global average pooling operation. The output 2048-dimensional representation is fed into the classifier. |
The code and models are publicly available at GitHub. |
paper |
We released the training and testing code and the pretrained model at GitHub |
more ... | |||||
Pose estimation | Semantic segmentation | Face alignment | Image classification | Object detection | |
|
[1]  | Deep High-Resolution Representation Learning for Human Pose Estimation. Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. CVPR 2019. |