YOLO(一)

YOLO 核心思想

  • 整张图作为网络的输入,直接在输出层用回归的方式得到bounding box的位置和bounding box所属的类别。
  • 之前的一些RCNN算法都是采用的是proposal+classification的思想

YOLO的实现

  • 先将图片分成S×S的网格(grid cell),如果某个object的中心落在这个网格中,则这个网格就负责预测这个object
    YOLO_1
  • 每个网格要预测B个bounding box,每个bounding box除了要回归自身位置(x,y,w,h)还要回归一个confidence值。
    这个confidence代表了所预测的box中含有object的置信度和这个box预测的有多准的双重信息,其值是这样计算的:其中如果有object落在一个grid cell,则第一项取1,否则取0.第二项是预测的bounding box和实际的ground truth之间的IOU值。
  • 每个bounding box要预测(x,y,w,h)和confidence共5个值,每个网格还要要预测一个类别信息,记为C类,则S×S个网格,每个网格要预测B个Bounding box还要预测C个类别。输入一张图,输出的是一个S×S×(5×B+C)的一个tensor.
    注意:class类别信息是针对每个网格的,confidence信息是针对每个bounding box的
    YOLO_4
    YOLO_6
    YOLO_7
    得到每个bbox的class-specific confidence score以后,设置阈值,滤掉得分低的boxes,对保留的boxes进行NMS处理,就得到最终的预测结果。
    YOLO_8

举例

网络流程图:

YOLO_5

在PASCAL VOC中,输入图像为448×448,S=7,B=2,共20个类别。 输出的tensor为7×7×30.
整个网络的图:
YOLO_2

每个网格有30维,这30维中,8维是回归box的坐标,2维是box的置信区间,还有20维是类别。
在test的时候,每个网格预测的class信息和bounding box预测的confidence信息相乘,就得到每个bounding box的class-specific confidence score:

等式左边第一项是每个网格预测的类别信息,第二第三项是每个bounding box预测的confidence.两者相乘,得到的是某个box属于某一类的概率。

在得到每个box的class-specific confidence score 以后,设置阈值,滤掉分低的boxes,对保留的boxes进行NMS处理,得到最终的检测结果。

NMS

这个算法不单单是针对YOLO算法的,而是所有的检测算法都会用到。NMS算法主要解决的是一个目标被多次检测的问题.

  • 1.首先从所有的检测框中找到该类置信度最大的那个框,
  • 2.然后挨个计算其与剩余框的IOU,如果其值大于一定阈值(重合度过高),那就剔除该框。(将其置信度置为0)
  • 3.然后对剩余的检测框重复上述过程(跳过那些已经置为0的框),直到处理完所有的检测框。

YOLO实现细节

  • 1.每个网格有30维,这30维中,8维是回归box的坐标,2维是box的confidence,还有20维是类别。其中坐标的x,y用对应网格的offset归一化到0-1之间,w,h用图像的width和height归一化到0-1之间。
  • 2.实现时如何设计损失函数平衡好三方面?作者简单粗暴的全部采用了sum-squared error loss来做这件事。这样会有问题:First: 8维的localization error和20维的classification error等同重要显然不合理. Second:如果一个网格中没有object,那么就会将这些网格中的box的confidence push到0,相比于较少的有object的网格,这种做法是overpowering的,这会导致网络不稳定甚至发散。因此做3、4操作
  • 3.更重视8维的坐标预测,给这些损失前面赋予更大的loss weight,记为$\lambda_{coord}$在pascal VOC训练中取5.
  • 4.对没有object的box的confidence loss,赋予小的loss weight,记为$\lambda_{noobj}$在pascal VOC训练中取0.5.
  • 5.有object的box的confidence loss和类别的loss的loss weight正常取1.

  • 6.对于不同大小的box预测中,相比于大box预测偏一点,小box预测偏一点肯定更不能被忍受。而sum-square error loss中同样的偏移loss是一样的。
    为了缓和这个问题,作者采用了一个比较取巧的方法,将box的width和height取平方根代替原本的height和width.

  • 7.一个网格预测多个box,希望的是每个box predictor专门负责预测某个object.具体做法就是看当前预测的box与ground truth box中哪个IOU大,就负责哪个。这种做法称作box predictor的specialization.
  • 8.最后整个损失函数如下所示:

    第一个和第二个双求和符号用来做坐标预测,$1_{ij}^{obj}$表示判断第i个网格中的第j个box是否负责这个object
    第二个双求和符号里表示的是含object的box的confidence预测
    第三个双求和符号里表示的是不含object的box的confidence预测
    第四个双求和符号里表示的是雷贝预测,$1_{i}^{obj}$表示判断是否有object中心落在网格i中
    YOLO_3

YOLO的缺点

  • YOLO对互相靠的很近的物体,还有很小的群体检测效果不好,这是因为一个网格中只预测两个框,并且只属于一个类
  • 对测试图像中,同一类物体出现的新的不常见的长宽比和其他情况。泛化能力偏弱
  • 由于损失函数的问题,定位误差是影响检测效果的主要原因。尤其是大小物体的处理上,还有待加强。

Code

YOLOLoss

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import torch
import torch.nn as nn
from torch.nn import functional
from torch.autograd import Variable
import torchvision.models as models


class YoloLoss(nn.Module):
def __init__(self, n_batch, B, C, lambda_coord, lambda_noobj, use_gpu=False):
"""
:param n_batch: number of batches
:param B: number of bounding boxes
:param C: number of bounding classes
:param lambda_coord: factor for loss which contain objects
:param lambda_noobj: factor for loss which do not contain objects
"""
super(YoloLoss, self).__init__()
self.n_batch = n_batch
self.B = B # assume there are two bounding boxes
self.C = C
self.lambda_coord = lambda_coord
self.lambda_noobj = lambda_noobj
self.use_gpu = use_gpu

def compute_iou(self, bbox1, bbox2):
"""
Compute the intersection over union of two set of boxes, each box is [x1,y1,w,h]
:param bbox1: (tensor) bounding boxes, size [N,4]
:param bbox2: (tensor) bounding boxes, size [M,4]
:return:
"""
# compute [x1,y1,x2,y2] w.r.t. top left and bottom right coordinates separately
b1x1y1 = bbox1[:,:2]-bbox1[:,2:]**2 # [N, (x1,y1)=2]
b1x2y2 = bbox1[:,:2]+bbox1[:,2:]**2 # [N, (x2,y2)=2]
b2x1y1 = bbox2[:,:2]-bbox2[:,2:]**2 # [M, (x1,y1)=2]
b2x2y2 = bbox2[:,:2]+bbox2[:,2:]**2 # [M, (x1,y1)=2]
box1 = torch.cat((b1x1y1.view(-1,2), b1x2y2.view(-1, 2)), dim=1) # [N,4], 4=[x1,y1,x2,y2]
box2 = torch.cat((b2x1y1.view(-1,2), b2x2y2.view(-1, 2)), dim=1) # [M,4], 4=[x1,y1,x2,y2]
N = box1.size(0)
M = box2.size(0)

tl = torch.max(
box1[:,:2].unsqueeze(1).expand(N,M,2), # [N,2] -> [N,1,2] -> [N,M,2]
box2[:,:2].unsqueeze(0).expand(N,M,2), # [M,2] -> [1,M,2] -> [N,M,2]
)
br = torch.min(
box1[:,2:].unsqueeze(1).expand(N,M,2), # [N,2] -> [N,1,2] -> [N,M,2]
box2[:,2:].unsqueeze(0).expand(N,M,2), # [M,2] -> [1,M,2] -> [N,M,2]
)

wh = br - tl # [N,M,2]
wh[(wh<0).detach()] = 0
#wh[wh<0] = 0
inter = wh[:, :, 0] * wh[:, :, 1] # [N,M]

area1 = (box1[:,2]-box1[:,0]) * (box1[:,3]-box1[:,1]) # [N,]
area2 = (box2[:,2]-box2[:,0]) * (box2[:,3]-box2[:,1]) # [M,]
area1 = area1.unsqueeze(1).expand_as(inter) # [N,] -> [N,1] -> [N,M]
area2 = area2.unsqueeze(0).expand_as(inter) # [M,] -> [1,M] -> [N,M]

iou = inter / (area1 + area2 - inter)
return iou

def forward(self, pred_tensor, target_tensor):
"""
:param pred_tensor: [batch,SxSx(Bx5+20))]
:param target_tensor: [batch,S,S,Bx5+20]
:return: total loss
"""
n_elements = self.B * 5 + self.C#B×5+C
batch = target_tensor.size(0)#Batchsize
target_tensor = target_tensor.view(batch,-1,n_elements)##batch×gridsize×elements
#print(target_tensor.size())
#print(pred_tensor.size())
pred_tensor = pred_tensor.view(batch,-1,n_elements)
coord_mask = target_tensor[:,:,5] > 0##confidence大于0的mask target 这个mask要作为索引
noobj_mask = target_tensor[:,:,5] == 0##confidence为0的mask target
coord_mask = coord_mask.unsqueeze(-1).expand_as(target_tensor)##batch × gridesize ×elements
noobj_mask = noobj_mask.unsqueeze(-1).expand_as(target_tensor)##batch × gridesize ×elements
###有物体的
coord_target = target_tensor[coord_mask].view(-1,n_elements)##目标坐标
coord_pred = pred_tensor[coord_mask].view(-1,n_elements)##预测坐标
class_pred = coord_pred[:,self.B*5:]
class_target = coord_target[:,self.B*5:]
box_pred = coord_pred[:,:self.B*5].contiguous().view(-1,5)#contiguous 内存调整 box预测 batch×gridesize
box_target = coord_target[:,:self.B*5].contiguous().view(-1,5) # box target


##无物体的
noobj_target = target_tensor[noobj_mask].view(-1,n_elements)##目标(-1,n_elements)
noobj_pred = pred_tensor[noobj_mask].view(-1,n_elements)##预测(-1,elements)

# compute loss which do not contain objects
if self.use_gpu:
noobj_target_mask = torch.cuda.ByteTensor(noobj_target.size())#(-1,elements)
else:
noobj_target_mask = torch.ByteTensor(noobj_target.size())
noobj_target_mask.zero_()##作为索引
for i in range(self.B):
noobj_target_mask[:,i*5+4] = 1
noobj_target_c = noobj_target[noobj_target_mask] # only compute loss of c size [2*B*noobj_target.size(0)]
noobj_pred_c = noobj_pred[noobj_target_mask]
noobj_loss = functional.mse_loss(noobj_pred_c, noobj_target_c, size_average=False)

# compute loss which contain objects
if self.use_gpu:
coord_response_mask = torch.cuda.ByteTensor(box_target.size())#(-1,5)
coord_not_response_mask = torch.cuda.ByteTensor(box_target.size())
else:
coord_response_mask = torch.ByteTensor(box_target.size())
coord_not_response_mask = torch.ByteTensor(box_target.size())
coord_response_mask.zero_()
coord_not_response_mask = ~coord_not_response_mask.zero_()
for i in range(0,box_target.size()[0],self.B):
box1 = box_pred[i:i+self.B]
box2 = box_target[i:i+self.B]
iou = self.compute_iou(box1[:, :4], box2[:, :4])
max_iou, max_index = iou.max(0)
if self.use_gpu:
max_index = max_index.data.cuda()
else:
max_index = max_index.data
coord_response_mask[i+max_index]=1#(i+max_index,:)为1
coord_not_response_mask[i+max_index]=0##初始化都为1

# 1. response loss
box_pred_response = box_pred[coord_response_mask].view(-1, 5)
box_target_response = box_target[coord_response_mask].view(-1, 5)
contain_loss = functional.mse_loss(box_pred_response[:, 4], box_target_response[:, 4], size_average=False)
loc_loss = functional.mse_loss(box_pred_response[:, :2], box_target_response[:, :2], size_average=False) +\
functional.mse_loss(box_pred_response[:, 2:4], box_target_response[:, 2:4], size_average=False)
# 2. not response loss
box_pred_not_response = box_pred[coord_not_response_mask].view(-1, 5)
box_target_not_response = box_target[coord_not_response_mask].view(-1, 5)

# compute class prediction loss
class_loss = functional.mse_loss(class_pred, class_target, size_average=False)####类别预测也是回归

# compute total loss
total_loss = self.lambda_coord * loc_loss + contain_loss + self.lambda_noobj * noobj_loss + class_loss
return total_loss

YOLO的优势就在于它可以直接将图片丢入网络中,直接回归得到它的坐标未知以及类别.所以代码的难点在于上述的YOLOLoss的计算.至于网络部分没有什么难度.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import torch.nn as nn


class Flatten(nn.Module):
def __init__(self):
super(Flatten, self).__init__()
def forward(self, x):
return x.view(x.size(0), -1)


class YOLO_V1(nn.Module):
def __init__(self):
super(YOLO_V1, self).__init__()
C = 20 # number of classes
print("\n------Initiating YOLO v1------\n")
self.conv_layer1 = nn.Sequential(
nn.Conv2d(in_channels=3, out_channels=64, kernel_size=7, stride=2, padding=7//2),
nn.BatchNorm2d(64),
nn.LeakyReLU(0.1),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.conv_layer2 = nn.Sequential(
nn.Conv2d(in_channels=64, out_channels=192, kernel_size=3, stride=1, padding=3//2),
nn.BatchNorm2d(192),
nn.LeakyReLU(0.1),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.conv_layer3 = nn.Sequential(
nn.Conv2d(in_channels=192, out_channels=128, kernel_size=1, stride=1, padding=1//2),
nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3, stride=1, padding=3//2),
nn.Conv2d(in_channels=256, out_channels=256, kernel_size=1, stride=1, padding=1//2),
nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=3//2),
nn.BatchNorm2d(512),
nn.LeakyReLU(0.1),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.conv_layer4 = nn.Sequential(
nn.Conv2d(in_channels=512, out_channels=256, kernel_size=1, stride=1, padding=1//2),
nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=3//2),
nn.Conv2d(in_channels=512, out_channels=256, kernel_size=1, stride=1, padding=1//2),
nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=3//2),
nn.Conv2d(in_channels=512, out_channels=256, kernel_size=1, stride=1, padding=1//2),
nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=3//2),
nn.Conv2d(in_channels=512, out_channels=512, kernel_size=1, stride=1, padding=1//2),
nn.Conv2d(in_channels=512, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
nn.BatchNorm2d(1024),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.conv_layer5 = nn.Sequential(
nn.Conv2d(in_channels=1024, out_channels=512, kernel_size=1, stride=1, padding=1//2),
nn.Conv2d(in_channels=512, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
nn.Conv2d(in_channels=1024, out_channels=512, kernel_size=1, stride=1, padding=1//2),
nn.Conv2d(in_channels=512, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=3, stride=2, padding=3//2),
nn.BatchNorm2d(1024),
nn.LeakyReLU(0.1),
)
self.conv_layer6 = nn.Sequential(
nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
nn.BatchNorm2d(1024),
nn.LeakyReLU(0.1)
)
self.flatten = Flatten()
self.conn_layer1 = nn.Sequential(
nn.Linear(in_features=7*7*1024, out_features=4096),
nn.Dropout(),
nn.LeakyReLU(0.1)
)
self.conn_layer2 = nn.Sequential(nn.Linear(in_features=4096, out_features=7 * 7 * (2 * 5 + C)))

def forward(self, input):
conv_layer1 = self.conv_layer1(input)
conv_layer2 = self.conv_layer2(conv_layer1)
conv_layer3 = self.conv_layer3(conv_layer2)
conv_layer4 = self.conv_layer4(conv_layer3)
conv_layer5 = self.conv_layer5(conv_layer4)
conv_layer6 = self.conv_layer6(conv_layer5)
flatten = self.flatten(conv_layer6)
conn_layer1 = self.conn_layer1(flatten)
output = self.conn_layer2(conn_layer1)
return output

参考文献

1.deepsystems.io

-------------本文结束感谢您的阅读-------------

本文标题:YOLO(一)

文章作者:Yif Du

发布时间:2018年11月20日 - 13:11

最后更新:2019年02月21日 - 14:02

原始链接:http://yifdu.github.io/2018/11/20/YOLO(一)/

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。