How do facial recognition algorithms work in 2021?

Aniket Maurya

Aug 3, 20214 min read

Updated: Sep 8, 2021

Building a robust facial recognition system that is free of bias (racial or gender) is not an easy task. Algorithms don't create bias. It comes from humans.

The last decade was full of new state-of-the-art algorithms and groundbreaking research in the field of Deep Learning. New Computer vision algorithms were introduced.

It all started when AlexNet, a Deep (Convolutional) Neural Network that achieved high accuracy on the ImageNet dataset (dataset with more than 14 million images) in 2012.

How do humans recognize a face?

Probably, the neurons in their brain first identify the face in the scene (from person body and background), extract the facial features, and by those features classify the person. We have been trained on an infinitely large dataset and infinitely extensive neural network.

Facial Recognition in machines is implemented the same way. First, we apply a facial detection algorithm to detect faces in the scene, extract facial features from the detected faces, and use an algorithm to classify the person.

The workflow of a Facial Recognition Algorithms

Workflow of facial recognition systems 1. Face detection

Face detection is a specialized version of Object Detection, where there is only one object to detect - Human Face. Just like computational time and space trade-offs in Computer Science, Machine Learning algorithms are also a trade-off between inference speed and accuracy. There are many object detection algorithms out there, and different algorithms have their speed and accuracy trade-offs.

We evaluated different state-of-the-art object detection algorithms:

OpenCV (Haar-Cascade)
MTCNN
YoloV3 and Yolo-Tiny
SSD
BlazeFace
ShuffleNet and Faceboxes

To build a robust face detection system, we need an accurate and fast algorithm to run on a GPU as well as a mobile device in real-time.

‍

Accuracy

In real-time inference on streaming video, people can have different poses, occlusions, and lighting effects on their faces. It is important to precisely detect faces in various lighting conditions as well as poses.

Detecting faces in various poses and lighting conditions

OpenCV (Haar-Cascade)

We started with Haar-cascade implementation of OpenCV, which is an open-source image manipulation library in C.

Pros: Since this library is written in C language. It is very fast for inference in real-time systems.

Cons: The problem with this implementation was that it was unable to detect side faces and performed poorly in different poses and lighting conditions.

‍

MTCNN

This algorithm is based on Deep Learning methods. It uses Deep Cascaded Convolutional Neural Networks for detecting faces.

Pros: It had better accuracy than the OpenCV Haar-Cascade method

Cons: Higher run time

‍

YOLOV3

YOLO (You look only once) is the state-of-the-art Deep Learning algorithm for object detection. It has many convolutional neural networks, forming a Deep CNN model. (Deep means the model architecture complexity is enormous).

The original Yolo model can detect 80 different object classes with high accuracy. We used this model for detecting only one object - Face. We trained this algorithm on WiderFace (image dataset containing 393,703 face labels) dataset.

There is also a miniature version of the Yolo algorithm available, Yolo-Tiny. Yolo-Tiny takes less computation time by compromising its accuracy. We trained a Yolo-Tiny model with the same dataset, but the boundary box results were not consistent.

Pros: Very accurate, without any flaw. Faster than MTCNN.

Cons: Since it has colossal Deep Neural Network layers, it needs more computational resources. Thus, it is slow to run on the CPU or mobile devices. On GPU, it takes more VRAM because of its large architecture.

SSD

SSD (Single Shot Detector) is also a deep convolutional neural network model like YOLO.

Pros: Good accuracy. It can detect various poses, illumination, and occlusions. Good inference speed.

Cons: Inferior to YOLO model. Though inference speed was good it was still not adequate to run on CPU, low-end GPU, or mobile devices.

BlazeFace

Like its name, it is a blazingly fast face-detection algorithm released by Google. It accepts 128x128 dimension image input. Its inference time is in sub-milliseconds. It is optimized to be used in mobile phones. The reasons it is so fast are:

It is a specialized face detector model, unlike YOLO and SSD, which were originally created to detect a large number of classes. Thus BlazeFace has a smaller Deep Convolutional Neural Network architecture than YOLO and SSD.
It uses Depthwise Separable Convolution instead of standard Convolution layers, which leads to fewer computations.

Pros: Very Good inference speed and accurate face detection.

Cons: This model is optimized for detecting faces from a mobile phone camera, and thus it expects that face should cover most of the area in the image. It doesn’t work well when the face size is small. So in the case of CCTV camera images, it doesn’t perform well.

‍

Faceboxes

The latest face detection algorithm we used is Faceboxes. Like BlazeFace, it is a Deep Convolutional Neural network with small architecture and designed just for one class - Human Face. Its inference time is real-time fast on the CPU. Its accuracy is comparable to Yolo for face detection. It can detect small and large faces in an image precisely.

Pros: Fast inference speed and good accuracy.

Cons: Evaluation is in progress.

‍

2.Feature extraction

After detecting faces in an image, we crop the faces and feed them to a Feature Extraction Algorithm, which creates face embedding- a multi-dimensional (mostly 128 or 512 dimensional) vector representing features of the face. We used the FaceNet algorithm to create face-embeddings.

The embedding vectors represent the facial features of a person’s face. So embedding vectors of two different images of the same person will be closer and that of a different person will be farther. The distance between two vectors is calculated using Euclidean Distance.

‍

3. Face classification

After getting the face-embedding vectors, we trained a classification algorithm, K-nearest neighbor (KNN), to classify the person from his embedding vector.

Suppose in an organization there are 1000 employees. We create face-embeddings of all the employees and use the embedding vectors to train a classification algorithm that accepts face-embedding vectors as input and returns the person's name.

A user could apply a filter that modifies specific pixels in an image before putting it on the web. These changes are imperceptible to the human eye but are very confusing for facial recognition algorithms - ThalesGroup

New tech brings new opportunities

Advancements in facial recognition have taken great leaps. But this is only the beginning of the technological revolution. Imagine how powerful the duo of facial recognition algorithms and chatbot technology is. It's never too late to become a part of this movement.

Thank you for reading! Read our blog on AI/ ML Investing