# FaceNet论文阅读

Our method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches.

To train, we use triplets of roughly aligned matching / non-matching face patches generated using a novel online triplet mining method

128-bytes per face

## Introduction

The network is trained such that the squared L2 distances in the embedding space directly correspond to face similarity: faces of the same person have small distances and faces of distinct people have large distances.

Previous face recognition approaches based on deep networks use a classification layer [15, 17] trained over a set of known face identities and then take an intermediate bottleneck layer as a representation used to generalize recognition beyond the set of identities used in training

The downsides of this approach are its indirectness and its inefficiency: one has to hope that the bottleneck representation generalizes well to new faces; and by using a bottleneck layer the representation size per face is usually very large (1000s of dimensions)

Some recent work [15] has reduced this dimensionality using PCA, but this is a linear transformation that can be easily learnt in one layer of the network

FaceNet directly trains its output to be a compact 128-D embedding using a triplet-based loss function based on LMNN [19].

FaceNet直接训练他的输出使其达到紧凑的128维embedding，使用基于LMNN 【19】的triplet 损失函数

Our triplets consist of two matching face thumbnails and a non-matching face thumbnail and the loss aims to separate the positive pair from the negative by a distance margin

triplet包含两张匹配的人脸，和一张非匹配的人脸，目的是用一对正例，和一对负例，并保证ap和an之间差一个margin（空隙，空间）

The thumbnails are tight crops of the face area, no 2D or 3D alignment, other than scale and translation is performed

Choosing which triplets to use turns out to be very important for achieving good performance and,

inspired by curriculum learning [1], we present a novel online negative exemplar mining strategy which ensures consistently increasing difficulty of triplets as the network trains.

To improve clustering accuracy, we also explore hard-positive mining techniques which encourage spherical clusters for the embeddings of a single person.
>

## RelatedWork

Similarly to other recent works which employ deep networks [15, 17], our approach is a purely data driven method which learns its representation directly from the pixels of the face. Rather than using engineered features, we use a large dataset of labelled faces to attain the appropriate invariances to pose, illumination, and other variational conditions.

1:The first architecture is based on the Zeiler&Fergus [22] model which consists of multiple inter-leaved layers of convolutions, non-linear activations, local response normalizations, and max pooling layers

2:The second architecture is based on the Inception model of Szegedy et al. which was recently used as the winning approach for ImageNet 2014 [16]

We have found that these models can reduce the number of parameters by up to 20 times and have the potential to reduce the number of FLOPS required for comparable performance

The works of [15, 17, 23] all employ a complex system of multiple stages, that combines the output of a deep con-volutional network with PCA for dimensionality reduction and an SVM for classification：

Zhenyao et al. [23] employ a deep network to “warp” faces into a canonical frontal view and then learn CNN that classifies each face as belonging to a known identity. For face verification, PCA on the network output in conjunction with an ensemble of SVMs is used.

Taigman et al. [17] propose a multi-stage approach that aligns faces to a general 3D shape model. Amulti-class network is trained to perform the face recognition task on over four thousand identities.

Sun et al. [14, 15] propose a compact and therefore relatively cheap to compute network. They use an ensemble of 25 of these network, each operating on a different face patch.

The verification loss is similar to the triplet loss we employ [12, 19], in that it minimizes the L2-distance between faces of the same identity and enforces a margin between the distance of faces of different identities.

Verification loss 类似于triplet loss 我们引用自12，19的方法，最小化L2距离，在同一个人之间和不同人之间的margin。

## Method

two different core architectures:
1. The Zeiler&Fergus [22] style networks

the recent Inception [16] type networks

Given the model details, and treating it as a black box (see Figure 2), the most important part of our approach lies in the end-to-end learning of the whole system.

To this end we employ the triplet loss that directly reflects what we want to achieve in face verification, recognition and clustering.

Namely, we strive for an embedding f(x), from an image x into a feature space Rd,

The triplet loss, however, tries to enforce a margin between each pair of faces from one person to all other faces. This allows the faces for one identity to live on a manifold, while still enforcing the distance and thus discriminability to other identities

Triplet loss 试图去注意在正样本对和负样本对之间的margin，这使得同一类的人脸存在于多种情况，同时也关注其间的距离，因此对于其他类可分辨。

## Triplet Loss

Here we want to ensure that an image

1.of a specific person is closer to all other images

2.of the same person than it is to any image

3.of any other person

$$||x_i^a-x_i^p||_2^2+\alpha < ||x_i^a-x_i^n||_2^2 ,\forall(x_i^a,x_i^p,x_i^n)\in\tau$$
alpha表示margin在正例对和负例对之间的距离，T是一组可能的triplets在训练数据集中的。基数为N。

$$\sum_i^N[||f(x_i^a)-f(x_i^p)||^2_2-||f(x_i^a)-f(x_i^n)||^2_2+\alpha]_{+}$$

Generating all possible triplets would result in many triplets that are easily satisfied (i.e. fulfill the constraint in Eq. (1)). These triplets would not contribute to the training and result in slower convergence, as they would still be passed through the network. It is crucial to select hard triplets, that are active and can therefore contribute to improving the model.

## Triplet Selection

In order to ensure fast convergence it is crucial to select triplets that violate the triplet constraint in Eq. (1)

$$argmax_{x_i^p}||f(x_i^a)-f(x_i^p)||^2_2$$

$$argmax_{x_i^p}||f(x_i^a)-f(x_i^n)||^2_2$$

It is infeasible to compute the argmin and argmax across the whole training set.

Additionally, it might lead to poor training, as mislabelled and poorly imaged faces would dominate the hard positives and negatives. There are two obvious choices that avoid this issue:

Generate triplets offline every n steps, using the most recent network checkpoint and computing the argmin and argmax on a subset of the data.

Generate triplets online. This can be done by selecting the hard positive/negative exemplars from within a mini-batch.

Here, we focus on the online generation and use large mini-batches in the order of a few thousand exemplars and only compute the argmin and argmax within a mini-batch.

To have a meaningful representation of the anchor-positive distances, it needs to be ensured that a minimal number of exemplars of any one identity is present in each mini-batch. In our experiments we sample the training data such that around 40 faces are selected per identity per mini-batch. Additionally, randomly sampled negative faces are added to each mini-batch.

Instead of picking the hardest positive, we use all anchor-positive pairs in a mini-batch while still selecting the hard negatives.

We don’t have a side-by-side comparison of hard anchor-positive pairs versus all anchor-positive pairs within a mini-batch, but we found in practice that the all anchor- positive method was more stable and converged slightly faster at the beginning of training.

We also explored the offline generation of triplets in conjunction with the online generation and it may allow the use of smaller batch sizes, but the experiments were inconclusive.

offline和online结合，这种方法允许使用更小的batch大小，但是结果是没啥结果。。

Selecting the hardest negatives can in practice lead to bad local minima early on in training, specifically it can result in a collapsed model (i.e. f(x) = 0). In order to mitigate this, it helps to select xni such that

$$||f(x_i^a)-f(x_i^p)||^2_2<||f(x_i^a)-f(x_i^n)||^2_2$$

We call these negative exemplars semi-hard, as they are further away from the anchor than the positive exemplar, but still hard because the squared distance is close to the anchor-positive distance. Those negatives lie inside the margin α.

As mentioned before, correct triplet selection is crucial for fast convergence.

On the one hand we would like to use small mini-batches as these tend to improve convergence during Stochastic Gradient Descent (SGD) [20].

On the other hand, implementation details make batches of tens to hundreds of exemplars more efficient.

The main constraint with regards to the batch size, however, is the way we select hard relevant triplets from within the mini-batches.

Batch size 的主要的约束是我们从mini-batches选择hard triplet的方式

In most experiments we use a batch size of around 1,800 exemplars

Batch size大约1800个样例

## Deep Convolutional Networks

In all our experiments we train the CNN using Stochastic Gradient Descent (SGD) with standard backprop [8, 11] and AdaGrad [5]. In most experiments we start with a learning rate of 0.05 which we lower to finalize the model. The models are initialized from random, similar to [16], and trained on a CPU cluster for 1,000 to 2,000 hours. The decrease in the loss (and increase in accuracy) slows down drastically after 500h of training, but additional training can still significantly improve performance. The margin α is set to 0.2.

The first category, shown in Table 1, adds 1×1×d convolutional layers, as suggested in [9], between the standard convolutional layers of the Zeiler&Fergus [22] architecture and results in a model 22 layers deep. It has a total of 140 million parameters and requires around 1.6 billion FLOPS per image.
The second category we use is based on GoogLeNet style Inception models [16]. These models have 20× fewer parameters (around 6.6M-7.5M) and up to 5×fewer FLOPS(between 500M-1.6B).