当前位置：网站首页>Reading notes: you only look once:unified, real time object detection

Reading notes: you only look once:unified, real time object detection

2022-07-24 19:17:00 【to be__】

One 、Abstract

Consider target detection as a regression problem , Use a single network to predict directly from a picture bounding box And class probability ,YOLO There will be more positioning errors, but very fast .

Two 、Introduction

At present, detection systems use classifiers to perform detection , In order to detect the target , These systems use classifiers for targets and use various positions and sizes on a test image to evaluate . similar DPM The system uses the sliding window method , That is, the classifier moves a certain spatial location on the whole image on average .

At present, it is more similar to R-CNN The method uses the region candidate method , First, generate a potential bounding box, Then run a classifier in these candidate boxes , In the post-processing process , Refine bounding box, Remove duplicate tests , These complex pipelines are very slow and difficult to optimize , Because each individual part must be trained separately .

We redefine target detection as a single regression problem, which is obtained directly from image pixels bounding box Coordinates and class probabilities .

A single convolutional neural network is used to predict multiple bounding box And these boxes Class probability of ,YOLO Train on the whole image and directly optimize the detection representation .

Unlike sliding windows and candidate region based techniques ,YOLO Look at a whole image during training and similarity measurement , Explicitly encode the whole information , Category information and representation .YOLO Learn a more generalized representation of the goal .

YOLO It still lags behind the advanced detection system in accuracy .

3、 ... and 、Unified Detection

Divide the input image into S*S A grid , If the center of the object falls in a grid , Then this grid is responsible for predicting and detecting the object .

Each grid predicts B individual bounding box And the confidence scores of these lattices , These confidence scores reflect the confidence of the grid containing the goal and the accuracy of the goal .

Every bbox Include 5 Predicted values ：x,y,w,h,c （x,y,w,h Are normalized ）

Each grid predicts C A conditional probability , The premise of these probabilities is that the grid has a target , Each grid predicts only one set （C individual ） Class probability , And B The value of has nothing to do with . The image output is 7*7*30 Size ,30 Contains two bbox Of x,y,w,h,c, The rest 20 Dimension outputs a set of class probabilities （C namely 20 individual ）.

At testing time , Put the grid With each bbox Confidence prediction Multiply , That is, the confidence score of the exact class of each lattice . Those points Numbers indicate bbox The probability that the target belongs to each category and bbox Match the quality of the target .

Degree of confidence ：confidence= Pr(Object) If there is a goal, it is 1, If there is no goal, it is 0

Each grid predicts B individual bbox, as well as bbox Of confidence score and confidence

（ Degree of confidence ：bbox The probability of containing goals ,bbox The accuracy of , Confidence is the value of multiplying the two ）

（ One grid cell Can predict B individual bbox,B individual bbox Separate from this object Of groud truth seek IOU value , Output IOU The biggest one bbox）

Four 、 Loss function

Every bbox It is necessary to calculate the positioning error （ Item 1 and 2 ） And confidence error （ The third one ）, Contains the probability of the grid prediction class of the target （ Item 5 ）, It does not include the confidence error of object prediction （ Item four ）

The first and second terms of the loss function are each bbox Coordinate prediction of , The third item contains goals bbox Of confidence forecast , The fourth item contains no goals bbox Of confidence forecast , The fifth item is category prediction for each grid

After an image is output , Is divided into S*S A grid ( This paper is about 7) Each grid predicts B（ This paper is about 2） individual bbox Then the whole image is divided into 7*7=49 A grid Whole image generation 7*7*2=98 individual bbox , Each grid predicts (5*B+C) It's worth , One image predicts S*S*(5*B+C) It's worth

i It means the first one i A grid ,j It means the first one i The th of the grid j individual bbox

i It means the first one i A grid

For those with goals box The punishment （ Great contribution , Then the power is great , by 5） For those without goals bbox The punishment （ Small contribution , Then the weight is small , by 0.5）