Face Detection and Match with TypeScript, HTML5 and Cloud
Share
This series will build upon one another to piece together an application that will ultimately snap a photo of a person, auto crop to the persons face and validate that person against a database of registered whitelisted registered faces.
Int part 0, this article, we will introduce our mini project.
In part 1 we will be using TypeScript to build the app, HTML5 JavaScript API’s to snap a photo with the users devices and Azure to extract facial data from the image we collect.
In part 2 we will introduce face-api.js to detect and auto-crop our faces in the client.
In part 3 we will also extend the application to allow the functionality to upload ‘registered’ user faces so we can later match against this whitelist and mark an attempted match as matched or unmatched against our registered users.
In part 4 we will explore alternate AI cloud providers to achieve our functionality.
Lastly, in part 5, we will add biometric security to our client using webauthn! This is somewhat still in standards agreement mode however browsers support this mechanism today! This means in part 5 we will implement fingerprint security into our TypeScript client.
First we will give a brief overview of current approaches to facial detection. We won’t go too into depth here since there are some real excellent resources already available to learn the machine learning approaches.
Sneak Peak of Behind the Scenes
So getting to the point. Most facial recognition tools today is done via a method called deep learning. Well … they use the product of deep learning that is…
Deep learning attempts to find patterns from input data which are just arrays of numbers (or arrays of arrays of numbers to the Nth degree). The more data you feed to the deep learning routine, the stronger the pattern recognition will be (if it finds any). The product at the end of this is called a model. A model is a set of numbers that we calculate our input data against when we want to make a prediction in production.
When we giving a deep learning routing an image, we are giving it a matrix of numbers. Width x Height x 3 (RGB). So we are providing a 3 dimensional matrix of number values. This is the same concept if we were using audio files or just tabular data, etc…, it’s all just matrices of numbers.
Starting from the very first image (number matrix) we give to our deep learning routine, we are ‘steering’ our model towards a less incorrect version of itself on every iteration. If you think about it when we first start out, our model is random (or initialized to zero) so really the performance of it is 50/50. Probability of it predicting patterns is at 50%, might as well flip a coin.
So we need to update our model on every iteration and the training routine does so by measuring the error between the predictions it came up with and what the truth is. The delta of truth minus prediction is an indicator of how much the training routine needs to alter the model so it will perform slightly better on the next training image.
After several thousands of iterations (could even be millions..), our model is tuned to predict input data very accurately.
The most popular deep learning technique for images is Convolutional Neural Networks. What CNN’s do is instead of counting each data point in the input data as individual data points (RGB values.. remember we’re talking about images here). CNN’s perform much better by splitting the image up into ‘convolutions’ and using a filtering technique that will instead recognize patterns and repeated features. Think what a mouth basically looks like or eyes or ears or hairlines. Also this approach will be much more flexible to the location of these features in the image.
You can think of it as the CNN is searching the entire image for feature patterns and the position of these features relative to other features.
The inner workings of deep learning uses neural net layers. In the simplest explanation I can come up with, these networks provide the ability to operate on non-linear problems into a bunch of linear equations. Being able to turn non-linear problems into a system of linear ones is important because now we can perform some calculus to calculate error and such. The more layers of these neural networks there are, the ‘deeper’ the model. Hence the ‘deep’ in deep learning. I won’t go into detail on this because Andrew Ng is BOSS as explaining this…
I sincerely recommend checking out the Deep Learning Specialization on Coursera for a MUCH more in-depth dive into deep learning. This specialization is spear-headed by Andrew Ng, one of THE top dogs in this field and the way he explains a lot of these concepts come as unintimidating and easy to learn, even if you’re rusty on math.
The tools that we will be using abstract these models out so you can still take advantage of facial detection without knowing (mostly) how it happens but still I would strongly recommend learning how deep learning/image processing works. Using sites like Coursera and Edx makes learning such material easy.
Moving On
Next we will work on obtaining an image from an HTML5 web application for us to use in our workflows.
You can check out the next post in this series here: Part 1