Deep Learning Based DIY Smart Camera System

Have fun building a deep learning based project that will make your security camera smarter.

The main aim of this article is to design a deep learning-based smart security system which is based only on the images from RGB cameras.
This system can be applied to the home/office security cameras.


The proposed system will be able to do the following tasks:

  1. Calculate no. of people in the room.
  2. Face Recognition
  3. Activity Tracking: To control peripherals in the room (AC/lights)
Basic model

Let’s start!

Face Detection

It is very important to first do face detection before face recognition. There are multiple models available for face detection. Some of them are built-in OpenCV itself. Face detection algorithms can be built using CNN networks but the main issue is to build a proper dataset. If we can create a bigger dataset to train a CNN based model then we can use our own model over built-in algorithms in OpenCV or Dlib.

Attributes to consider:

  1. Accuracy
  2. Speed
  3. Computation Power
  4. Use case (Actual working)
Accuracy:

Here’s a comparison between some of the algorithms on different datasets. Here the accuracy score is given based on the detected faces and the actual ground truth.

Based on the analysis this is a wrong way to analyse these algorithms. The best way to analyse these algorithms is based on their AP_50 scores. AP_X means precision when there is X% overlap between ground truth and detected boxes. The Dlib models have higher accuracy than the OpenCV models.

Speed and Computation

As shown in fig. The MMOD (with GPU) has significantly higher speed than other models but it requires a better GPU ie it requires more computation.

Use Case:

The Dlib models are very useful when the input images have high resolution but they have high accuracy only in ideal cases. They have low accuracy in low-resolution images, low light environments and have high computational needs. Security cameras have low resolution and the faces to be detected may not be aligned and can have large distance from the actual camera. A corporate office can have multiple cameras applying algorithms on each image can be computationally expensive. The OpenCV algorithms are better at recognising the smaller faces and they are computationally less expensive as compared to the other algorithms. OpenCV algorithms are better at recognising not aligned faces as well. Hence it is better to use built-in OpenCV algorithms for recognition.

Anti-spoofing

The only problem in using OpenCV algorithms is that they can be spoofed easily or they can detect false faces. The anti-spoofing techniques can be used when our system is used for security purposes.

fig 4. Statues

Both fig 3 and fig 4 cases can be solved by eye-tracking algorithms. The blink of an eye defines that the person is alive. But the detecting eyes for each face in wide-angle footage of a camera is not possible. In our use case, we are getting images from the indoor CCTV cameras. So there is a low possibility of facing issues in fig 3 (it is not true for security purposes). To solve the issue in fig. 4 we can use face tracking with time. The statues are steady objects. Hence when face tracking is applied we get zero movements for their faces all the time. Hence we can ignore the faces with zero movements in our input image.

Face Recognition

After performing the face detection task we can use face recognition models to identify the person. There are multiple face recognition models available. Recent research papers have published models with better accuracy than humans.

These face detection models are created by big organisations like Facebook and Google. They have trained these models on more than 10⁶–10⁷ images with unique IDs. The dataset is not available for the public. So, to create a new custom face recognition model we have to create a bigger dataset and will need more computation. The models like VGGface and CoCo are open source and can be used by any organisation. Also, they have high accuracy.

In our use case, based on the room and office type, the no. of person per camera may not exceed <100. So, we can use models with higher accuracy than the models that can handle multiple people. Based on the research we can use the VGGFace model in our use case.

VGGface has high accuracy in predicting the faces at any angle. The recent model is trained on VGGFace 2 dataset which contains more than 3.3 million images.

The reason to use the VGGface model over the more recent model is because the recent models are trained on smaller datasets as compared to the VGGface. Also, they are designed specifically to reduce loss. Most of the models are not open source and require licensing.

Actual Implementation:

The face recognition algorithms can be backed up by any database which can contain face data of each employee. So that the face recognition algorithm can give results for those employees. Special face tracking can be done when the face(person) is unknown.

Counting no. of people

Once the system is done with the face detection we can use this data for counting no of people in the room. In our use case, each part of the office is covered by CCTV cameras hence this count prediction can be done based on CCTV footage itself. By counting no of faces detected by each camera we can predict no of people in the room. The real challenge can be faced when there are multiple cameras in the room having the overlapping angle of vision.

This problem can be solved based on a previously created database which will consist of an ID for each camera and its position(room no) within the office. If two or more cameras are present in the same room and they detect the same person (face recognition) in the room then we can ignore the face count for that person from one of those cameras.

The proposed solution is completely based on the face detection algorithm but in reality, we may not be able to detect faces at each time. To back up this solution we can use object(human body) detection algorithms as well. Models like YOLO are specifically designed for object detection. By using transfer learning techniques we can train YOLO for human detection. To integrate both the algorithms together and to have the same count in each case we can calculate percentage overlapping.

face detection and object detection
fig 8. The red bounding box is detected using object detection and yellow with face detection.

We assign the face data to object data if the percentage overlap is >90%. But if we have a condition like shown in fig 8 where the object is recognised but the face is away from the camera and not detected then we can use only the object data instead of finding face data. Hence we rely more on object data rather than face data in this case.

Another problem can be faced when two or more cameras have overlapping visions and create a dark zone where the object in front of the camera hides the objects behind like shown in fig 9. To solve this problem through the existing placement of cameras we require object data from other cameras. We can predict the number of people in the room by counting no of people leaving the adjacent rooms. The simple solution would be to face cameras at such a position where they don’t overlap their vision or create a dark zone. Still, to predict no of people more accurately depth-sensing cameras are the idle choice.

Activity Tracking

The main purpose of activity tracking is to control all the peripherals like AC, fan and lights in the room. In the ideal case, we use activity tracking for energy-saving purposes. There are two ways to tackle this challenge.

  1. Tracking based on the previously created face and object data.
  2. Individual tracking based on skeleton data.

Tracking based on face data and object data

Our main goal is to save energy and control peripherals. To control the AC/lights in the room we can use previously gained data of the no of people. Also, we can use face data.

We can set timers for each face or object for each camera vision. If the person has been only for a certain period of time in the room then we don’t need to turn on the AC. We will turn on peripherals like AC/Fan after a certain period of time, by setting a threshold time.

Time tracking
fig 10. Time tracking by face data

This solution is simple to perform and can give near-perfect results. But the main issue with this approach occurs when the subject is not facing the camera or just had a turn(the counter will start as new). That can still be solved by object data we’ve collected and can give a certain buffer time period for each face when it disappears(turning the head away from the camera). There is a more robust way to track time for each object.

Individual tracking based on skeleton data

As we are using only RGB cameras in our use case the skeleton tracking would be a slight overkill for our system but it can give more data and also more details about the actions of employees.

The spatiotemporal volume-based methods require depth-based data to detect the skeleton at the training phase. Once trained they can be used to track primitive actions like walking, sitting, bending and standing. By using CNN in combination with LSTM networks we can build a model that can track objects. Then like our previous approaches, we can combine this with the time tracking to know how much time the person has been in the room.

To save energy more efficiently we can combine the no. of persons(count data) with the ambient temperature and actual illumination of lights in the room.

Hence we can create a DIY system based on Deep learning algorithms.

Final proposed system

Final model
final model

References:

  1. Face Detection in 2 Minutes using OpenCV & Python — Towards Data Science
  2. Using Face Detection with your Security Cameras / CCTV Camera World Knowledge Base
  3. A Gentle Introduction to Deep Learning for Face Recognition
  4. [1804.06655] Deep Face Recognition: A Survey
  5. Face recognition with Keras and OpenCV — Above Intelligent™ AI/ML
  6. How to Develop a Face Recognition System Using FaceNet in Keras
  7. OpenCV Face Recognition — PyImageSearch
  8. VGGFace2
  9. Exploring STIP-based models for recognizing human interactions in TV videos — ScienceDirect
  10. Human activity recognition with analysis of angles between skeletal joints using a RGB‐depth sensor — İnce — — ETRI Journal — Wiley Online Library
  11. Paper on tracking
  12. Face Detection — OpenCV, Dlib and Deep Learning ( C++ / Python )

Leave a Reply

Your email address will not be published. Required fields are marked *

*