当前位置：网站首页>OpenPose Basic Philosophy

OpenPose Basic Philosophy

2022-08-02 16:02:00 【zhangyu】

Introduction

OpenPose Human Pose Recognition Project is an open source library developed by Carnegie Mellon University (CMU) based on convolutional neural network and supervised learning and using caffe as the framework.It can realize pose estimation such as human motion, facial expression, finger movement and so on.Excellent robustness for single and multiplayer.It is the world's first real-time multi-person 2D pose estimation application based on deep learning, and instances based on it have sprung up.Human posture estimation technology has broad application prospects in the fields of sports fitness, motion acquisition, 3D fitting, public opinion monitoring, etc. The application that people are more familiar with is the Douyin embarrassing dance machine.

Highlights

Proposed Part Affinity Fields (PAFs), each pixel is a 2D vector, used to represent position and orientation information.Based on the detected joint points and joint connection areas, using the greedy inference algorithm, these joint points can be quickly mapped to different individuals.
OpenPose is an open source library based on convolutional neural network and supervised learning and written in the framework of caffe, which can realize the tracking of human facial expressions, torso, limbs and even fingers, not only for single people but also for multiple people.Has better robustness.It can be called the world's first real-time multi-person 2D pose estimation based on deep learning. It is a milestone in human-computer interaction and provides a high-quality information dimension for machines to understand people.

Process

Enter an image, extract features through a convolutional network, and obtain a set of feature maps, which are then divided into two forks, and the CNN network is used to extract Part Confidence Maps and Part Affinity Fields respectively;
After getting these two pieces of information, we use Bipartite Matching in graph theory to find the Part Association and connect the joint points of the same person. Due to the vector nature of the PAF itself, the generated bipartite matching is very correct, and finally merged into the overall skeleton of a person;
Finally seeking Multi-Person Parsing based on PAFs—>Converting the Multi-person parsing problem into a graphs problem—>Hungarian Algorithm (Hungarian Algorithm)
(Hungarian Algorithm is an algorithm for partial graph matching, the core of the algorithm is to findAugmented path, which is an algorithm for finding the maximum matching of bipartite graphs using augmented paths.)

Convolutional Neural Networks

Insert image description here
The network is divided into two branches, upper and lower, to predict key points respectivelyHeatmaps and paf maps.Each branch has t stages, representing more and more fine-tuned, and each stage will fuse feature maps.where ρ φ represents the network.During training, loss is generated at each stage to avoid vanishing gradients; only the output of the last layer is used for prediction.

Highlight 1: PAF-Part Affinity Fields
PAF (Part Affinity Fields), part of the area affinity.It is responsible for encoding the 2D vector of limb position and orientation in the image domain.At the same time, use CMP (Part Detection Confidence Maps) to mark the confidence of each key point (the so-called "heat map").Through two branches, keypoint locations and their connections are jointly learned.Simultaneously infer these bottom-up detections and associations, using a greedy parsing algorithm, which can encode enough global context to obtain high-quality results at a fraction of the computational cost.In parallel, it basically achieves real-time, and the time-consuming is not strongly related to the number of people in the picture.
Highlight 2: High Robustness
CMU's data acquisition equipment, a closed ball, can collect human data from any angle.The big ball is inlaid with 480 VGA cameras+31 HD cameras+10 Kinect Ⅱ Sensors+5 DLP Projectors. And all of them are synchronized by hardware.Massive high-quality data enables robust human pose detection based only on 2D images.
Highlight 3: Landmarks ternary normalization
At the beginning, the human skeleton joints were done by people who recognize actions by behavior analysis, and the facial landmark extraction was done by the face recognition or beauty algorithm development team. The hand joints wereThe gesture recognition human-computer interaction team is doing it, which belongs to different subdivision directions.The CMU team has achieved good results in the recognition of human skeleton joints, so the face and hand are integrated into a unified graph, and the effect is also good.Face alignment and pose alignment are linked together, and according to the rigid body properties of the human head and the non-rigid body characteristics of the limbs, a set of caffe-based point estimation and diffusion models are designed, and a tree-like decision-making acceleration is established, based on which 3D background segmentation is added.technology.

Single Person Pose Estimation (Algorithmic Thought of CPM)

The large convolution kernel adopted by the CPM model to obtain a large receptive field, which is very effective for inferring occluded joints.The network structure is as follows: insert image description here

The flow of the entire algorithm is:

a) First, regress all the people appearing in the image, and return to the joint points of each person
b) Then remove the response to other people according to the center map
c) Finally, by repeating theThe predicted heatmap is refined to obtain the final result. When refining, the loss of the intermediate layer needs to be introduced, so as to ensure that the deeper network can still be trained without gradient dispersion or explosion.Gradually improve the accuracy of regression by coarse to fine.

Shortboard

Memory consumption

The amount of calculation is very large. In order to achieve real-time purposes, a high-parallel strategy is used.Based on cuda acceleration, it is very memory-intensive, and basically discourages machines with video memory below 4G (GTX 980ti+)