Computer Vision

Computer Vision for surveillance and security

Context

Computer vision is on the rise. The emergence of deep learning, combined with the drop of both computing power and data storage costs, as well as an explosion in the number of pictures and videos taken, have led to the boom of potential usages in Computer Vision. Our client sells software to help defense entities and security companies monitor physical sites, and detect potential threats. Indeed, the detection activity, once limited to the capacity of a human to focus thoroughly and unstoppingly on a screen to perform this task, can now theoretically be leveraged and used at high scale 24/7 with the help of computer vision. Nevertheless, there is a huge gap between theory and practice when it comes to Computer Vision. The challenges to develop and implement such a detection system are multiple. In our case, the size of a picture (100+Mb), the need for real-time, the noise in a picture (think of fog and the reflects on the sea for instance), the size of objects to be detected (sometimes less than 12 pixels), the variability of the objects to be detected (if you want to detect a threatening boat for instance, how do you define it ?), the number of labeled examples, etc… all these elements make it difficult to deploy such a system. Our client wanted support from experts to crack some of these challenges. They decided to call MFG Labs to help them in their R&D phase and find an appropriate solution.

Solutions

Preliminary work

The first part of such a mission is to be clear on what you want to achieve. In data projects, the objective most of the time is to define clearly what is a relevant metric to follow, and how to optimize this metric. How do you define the ability to correctly detect a threat in a video? In our case, we defined the following objectives : maximize the recall, i.e make sure that you don’t omit to detect a threat, and make sure that you correctly detect a target, meaning considering that you have detected an object when you have detected at least 50% of the surface of the object.

In parallel, we also spent some time exploring the dataset. Our dataset was made of 84,000 pictures in raw and jpeg format, from which around 10,000 were already labeled with 26,000 bounding boxes covering 900 objects. This dataset represents various video sequences taken while monitoring sea and port activities, that were cut into several images to be analyzed. This dataset was chosen to represent a certain amount of situations in which conditions differ (weather and sea conditions, distance from target/object to detect, etc.). The exploratory analysis helped us understand all the object classes that we had to work on, the typical distribution and already think of algorithm approaches to test on.

We were also able to get a sense of the specificity of IR images, test different treatments that were possible to get to a classical 8b JPG image - a prerequisite for most Computer Vision algorithms - and test basic temporal and spatial filters that would allow to highlight targets on some videos: if we see it better, so will the machine!

The conclusion of this preliminary analysis was that different solutions may be needed following 2 dimensions: the target size (very small and undefined targets vs. targets on which we can see the specific shapes of a boat), and background type: noisy and cluttered sea vs clear and steady sea. Our client’s priority was noisy background and small targets - but not missing big and obvious targets was also a prerequisite.

Two different approaches were then decided for the conception of an object-detection algorithm suitable to our client’s priority.

1st approach: Anomaly Detection algorithms

In the first approach, we consider targets as anomalies in the picture; this is especially true for very very small boats isolated in the sea; we can barely distinguish something else than a group of pixels which seems to be something else than sea.

A variety of anomaly detection algorithms on pictures have been tested, such as Mixed Spatial and Temporal Filters, Robust Component Principal Analysis, NL-means method with a-contrario theory, and homemade variations taking into account the temporal dimension.

Performances

Whereas RCPA showed good results on uncluttered sea and is especially good at detecting very very small targets, NL-Means has overall better performances, in particular on noisy sea and with small and medium targets. NL-Means had the advantage of being suitable for different sizes of targets; for the RCPA approach, the parameters are highly dependent on the size of the targets.

Yet these approaches had a significant drawback: they require intensive computing, and a real time implementation on a full-HD picture (+100Mb) is very challenging. They also produce lots of false positives; our goal is to maximise recall (catch all true positives), which can lead to false positives being detected, but a trade-off must be found in order to avoid too many false alerts for the end-user. Taking into account the temporal dimension allows us to significantly reduce the number of false positives - but we also lose some targets.

2nd approach : Object detection with deep learning

In the second path we consider targets as objects with a specific shape; now we are talking about object detection rather than anomaly detection. In the first path we tried to model the background, here we are learning specifically what a boat is and how to recognize it.

Convolutional networks have led to impressive results in terms of object classification and detection. However it hinges on learning on huge datasets, and we didn’t have such an amount of data. A way of circumventing this challenge is to use transfer learning, taking algorithms which have been pre-trained on millions of images for general object classification purpose, and fine tune them in order to:

  • be effective on infrared images, given that existing algorithms have been trained in the visible spectrum
  • recognize objects corresponding to our targets.

To quickly have a first sense on how well this approach could suit our problem, we trained a simple pre-trained classification network on images from our dataset. The classification of the different boats had good performances, and even if there could be some misclassifications between the different classes of boats, confusion matrix showed that very few boats were classified as background. This work validated the relevance of a deep learning approach for our challenge and our dataset.

There are already very efficient algorithms for object detection on pictures. In order to get the maximum of deep learning for our challenge, we selected the Faster-RCNN algorithm, which is one of the best of existing state-of-the-art algorithms in terms of detection performance with still a correct speed performance. There are quicker algorithms but with lower performance, and we first wanted to see if we were able to get a significant gap in terms of detection performance compared to previous methods. We used an implementation from Facebook’s Detectron2 library, released in October 2019.

With successive fine-tuning steps on the different networks of the model, we have succeeded at adapting a pre-trained algorithm on the visible spectrum to IR images, and to boat detection exclusively.

Performances

We managed to get an “absolute” recall of slightly more than 55%, which means that over all images of the test set, we detected more than 50% of targets. However to have a correct interpretation of the real performances of our algorithm for our client’s use case, we had to consider tracking metrics; these metrics indicate two things :

  • we can only consider that a target is detected (and then displayed on the end user screen) if we detect it 3 times successively
  • if a target is detected, we can anticipate its trajectory which allows to lose it sometimes

Put it differently, we can afford to miss a target several times, if we are able to detect it correctly before (3 times successively on a trajectory). With this metric, we showed that we were able to detect more than 80% of the targets - a huge gap considering the current solution, especially on the noisy background and for small targets.

Interestingly, the targets for which we had the poorest detection rate were big targets, because they were under-represented in the dataset - highlighting a potential for even better performances.

What about false positives ? This project is more a “recall” project than a “precision” one ; however one must check that we don’t overflow the end user with false alarms. We had a 30% precision on object detection - which is by far higher than all other previous methods and validate the ability of CNN approach to identify the key features of what is a boat - but here once more, we have to think globally and integrate the tracking metric; most of false positives won’t pass the tracking filter (it is very unlikely that we find 3 false positives following what seems to be a plausible trajectory).

Results

Object detection with deep learning has shown clear advantages on anomaly detection techniques on both pure detection performances and velocity.

Still, the algorithms we tested were not able to cope with a large picture in real time - the equivalent of 30 Full-HD pictures. Rooms of improvements on both the algorithmic and the hardware/computational aspects were identified to get the temporal performance up to an order of magnitude, which would allow an implementation in our client’s software - while trying to keep up with good detection performances.