Using Videos For Facial Recognition
Facial recognition has been rapidly piquing interest in the computer vision field. In the past few years, various facial recognition frameworks have achieved high accuracy on still-image datasets.
However, there has always been vastt interest in video facial recognition processes, especially in connection with the ever-growing practical needs: security systems, criminal investigation, credit card verification, attendance tracking, phone unlocking, forensic investigations, web application, and business intelligence.
This technology can be applied almost everywhere in the world to help protect citizens. As shown in Figure 1, the facial recognition market shows significant projected growth on the global horizon within the next decade, especially in the increasing demand for surveillance systems.
Figure 1: Global face recognition market by application.
Facial recognition mainly consists of the following steps:
Face detection. A face detector finds the location of the faces and labels them with a bounding box in an image.
Face alignment. Transform the face images by cropping and scaling to the fixed locations. This process is usually adopted in a facial landmark detector.
Face Representation. The pixel values of a face image are transformed into feature vectors in this stage. Ideally, all the faces of the same subject should map to similar feature vectors.
Face matching. The labeled images are compared to the datasets in order to find which face belongs to the person.
The goal of facial recognition is to locate and identify the faces in the given images, only capturing important objects in specific areas while ignoring the remaining.
Unlike still images, video-based facial recognition is advanced in handling special-temporal information consisting of a series of frames.
This makes it possible to collect samples of the desired subject under a wide range of conditions and gives in-depth information on input data. These samples can thus be used for improving the accuracy of the video-based recognition methods.
There already are several approaches to facial recognition, as follows. In early research, image processing is implemented to describe the geometry of the faces.
The researchers adopted statistical subspaces methods, such as principal component analysis (PCA) and linear discriminant analysis (LDA) based on hand-crafted features.
The main idea was to input the entire faces and let the feature extractors to describe the texture of an image at different locations.
After that, use the 3D histograms of texture (3DHoTs) model to combine the features with their texture information in a sequential order to perform the action.
Another popular way of facial recognition is to use rank pooling. This approach first detects and extracts the objects’ temporal motions and then applies ranking to these features in a timely order.
However, the major issues in these researches are the scale of the recognition, its efficiency and the difficulty in dealing with unconstrained backgrounds. Current models do not support complete frame mapping; only crucial features are detected and recognized.
The algorithms in the models should be improved and optimized to analyze the entire domain. Besides, the current models are not computationally effective and may take large resources and time for the computation. This can be improved with better usage of GPU.
Thanks to the emergence of deep learning methods, Convolutional Neural Networks (CNN) can overcome these challenges for video-based facial recognition.
CNNs can be trained by large amounts of data and learn from them. After training, the model can be used to recognize the objects that are not present in the training dataset. Furthermore, the deeper architecture can also collect rich semantic features.
In a video, there may exist a high risk of an unconstrained environment that contains low-quality image frames. In order to detect the faces in those images, some researches aim to exploit the relationship between instances among these video frames.
Perhaps the data from neighbor frames can be combined, aggregated, or recovered by the current frame?
Elaborating on this, we can combine deep learning methods with attention mechanisms. The input will be an entire video that contains rich temporal coherence information to feed in a CNN network. The attention model is used to evaluate valuable frames.
Since the frames in the videos are of different variations in pose, illumination, facial expression, and image quality, not all frames are useful for facial recognition. Discarding useless frames makes the networks extract meaningful features to recognize faces.
Going forward, new methods may be applied to facial recognition that could cause several of these methods to become obsolete. This field will surely attract a huge amount of scientific research in the near future.
Written by Qilin Guo
Edited by Alexander Fleiss, Xujia Ma, Michael Ding, Jack Argiro, Gihyen Eom, Calvin Ma & Helen Wu