Oleg Zabluda's blog
Wednesday, September 21, 2016
OpenFace version 0.2.0 that improves the accuracy from 76.1% to 92.9%, [...], and decreases the deep neural network training time from a week to a day. This blog post summarizes OpenFace 0.2.0 and intuitively describes the accuracy- and performance-improving changes.
The network computes a 128-dimensional embedding on a unit hypersphere and is optimized with a triplet loss function as defined in the FaceNet paper. (http://arxiv.org/abs/1503.03832). A triplet is a 3-tuple of an anchor embedding, positive embedding (of the same person), and negative embedding (of a different person). The triplet loss minimizes the distance between the anchor and positive and penalizes small distances between the anchor and negative that are “too close.”
The original OpenFace training code randomly selects anchor and positive images from the same person and then finds what the FaceNet paper describes as a ‘semi-hard’ negative. The images are passed through three different neural networks with shared parameters so that a single network can be extracted at the end to be used as the final model.

Using three networks with shared parameters is a valid optimization approach, but inefficient because of compute and memory constraints. We can only send 100 triplets through three networks at a time on our Tesla K40 GPU with 12GB of memory. Suppose we sample 20 images per person from 15 people in the dataset. Selecting every combination of 2 images from each person for the anchor and positive images and then selecting a hard-negative from the remaining images gives 15*(20 choose 2) = 2850 triplets. This requires 29 forward and backward passes to process 100 triplets at a time, even though there are only 300 unique images. In attempt to remove redundant images, the original OpenFace code doesn’t use every combination of two images from each person, but instead randomly selects two images from each person for the anchor and positive.

Bartosz’s insight is that the network doesn’t have to be replicated with shared parameters and that instead a single network can be used on the unique images by mapping embeddings to triplets.

Now, we can sample 20 images per person from 15 people in the dataset and send all 300 images through the network in a single forward pass on the GPU to get 300 embeddings. Then on the CPU, these embeddings are mapped to 2850 triplets that are passed to the triplet loss function, and then the derivative is mapped back through to the original image for the backwards network pass. 2850 triplets all with a single forward and single backward pass!

Another change in the new training code is that given an anchor-positive pair, sometimes a “good” negative image from the sampled images can’t be found. In this case, the triplet loss function isn’t helpful and the triplet with the anchor-positive pair is not used.


| |


Powered by Blogger