Recently, I have been extensively working on my university project “Synthesizing information and simulation for search and rescue operations in any urban affected disaster” with few of my batch-mates under Prof Maya Menon.
To explain in layman terms, our project aims to propose an efficient approach for navigating through an unknown disaster environment, identifying human victims (dead or alive) through a team of robots (UAVs and UTVs) using Robot Operating System (ROS) i.e. it would be a software simulation of a given situation. We decided to divide our project into different parts and my work is focused on identifying humans and hence in this post I will be writing about how the object detection algorithms have evolved over time and how and why we chose to go with YOLO for our project.
So, what is object detection?
Object Detection is basically a task of identifying objects in an image with bounding boxes indicating their location.
Conventional approaches for solving this problem of object detection depended on re-purposing image classifiers (techniques for identifying type of object in an image) as object detectors using a technique known as Regional Proposal Network. The region proposal network would propose a redundant set of overlapping bounding boxes inside the image as possible useful areas. Then a classifier would try to identify the type of object in each bounding box. So, let’s dive deeper on how these algorithms really worked.
Let’s go back to 2001.
It was 2001 when the first algorithm came out that really worked for real-time facial detection (with webcam) which was Viola Jones Algorithm (developed by Viola and Jones, ofcourse). This algorithm did what the other algorithms before this generally did, hand coded features and fed them into classifier, namely Support Vector Machine which makes the training very slow, but the detection pretty fast. Although this algorithm can be trained to detect a variety of objects but it was primarily used to solve the problem of facial detection. The main problem with this algorithm was detection of faces at different configurations.
In 2005, a much more efficient technique (which is still used for various purposes), Histograms of Oriented Gradients (or HOG) was released by N Dalal and B Triggs. It is still mainly used for human detection. The main idea behind this technique was that the input is supposed to be an image of the face and the goal is to find how dark is the current pixel compared to its surrounding pixels. So, for every single pixel in the image, compare how dark it is to its surrounding pixels and then draw an arrow in the direction in which the image got darker. Repeat this process for every single pixel in the image so every pixel was then replaced by an arrow (or gradient) which showed the flow from light to dark across the entire image. Once the matrix of gradients was build, it then broke up the image into small squares of 16 * 16 pixels each and for each square, it counted how many gradients point to each major direction and then finally, replace those squares in the image with the strongest arrows direction. The end result of the image or the output looked like an essence (basic structure) of a face (feature map) with just arrows and then using a similarity matrix (euclidean distance) it figured out how similar the output image was to the original image and with a threshold value set it would identify if it is a face or not.
HOG algorithm (2005)
The biggest disadvantage of this algorithm was that in this feature map, the features were hardcoded and not generated by themselves which generally happens in neural networks.
But what's the idea behind neural networks?
The basic idea behind a neural network is to simulate (copy in a simplified but reasonably faithful way) lots of densely interconnected brain cells inside a computer so you can get it to learn things, recognize patterns, and make decisions in a humanlike way. The amazing thing about a neural network is that you don't have to program it to learn explicitly: it learns all by itself, just like a brain!
But it isn't a brain. It's important to note that neural networks are (generally) software simulations: they're made by programming very ordinary computers, working in a very traditional fashion with their ordinary transistors and serially connected logic gates, to behave as though they're built from billions of highly interconnected brain cells working in parallel. No-one has yet attempted to build a computer by wiring up transistors in a densely parallel structure exactly like the human brain. In other words, a neural network differs from a human brain in exactly the same way that a computer model of the weather differs from real clouds, snowflakes, or sunshine. Computer simulations are just collections of algebraic variables and mathematical equations linking them together (in other words, numbers stored in boxes whose values are constantly changing). They mean nothing whatsoever to the computers they run inside—only to the people who program them.
To know more about neural networks and how it works, I would suggest you to read about it here.
In 2012, when the deep learning era began, CNNs were used for object classification and detection.
CNNs, like neural networks, are made up of neurons with learnable weights and biases. Each neuron receives several inputs, takes a weighted sum over them, pass it through an activation function and responds with an output. The whole network has a loss function and all the tips and tricks that we developed for neural networks still apply on CNNs. Pretty straightforward, right? (Read more about CNNs, here).
CNN algorithm used classifiers (like vggnet) and divided the input image into squares. For every square in the input image, it scrolled the classifier throughout the image and identified the objects. The major drawback of this was it needed huge computation powers and also was a very bruteforce-y approach (basically, it was dumb).
In 2014, R-CNN was released. Basically what RCNN did was that before feeding the input image into CNN, it created bounding boxes across the image using selective search and then looked into the image through windows of different sizes and for each size tried to group adjacent pixels by color/texture/intensity together to identify objects. (Create bounding boxes beforehand - feed those boxes into a CNN - compute a list of features and then eventually class values from them.)
But in 2016, when YOLO (You Only Look Once) algorithm got released, it outperformed all of the algorithms; RCNN and all of its variants. You only look once (YOLO) is a state-of-the-art, real-time object detection system. It takes a completely different approach. It is not a traditional classifier that is repurposed to be an object detector, but what it does is that it actually looks into the image just once and produce extraordinary and fast results. There have been significant improvements since the first version of YOLO released.
The basic idea behind YOLO's working is pretty simple. You give an input image, it resizes it to 416 * 416 pixels, then it goes through the convolutional network in a single pass which comes out on the other end as a 13 * 13 * 125 tensor describing the bounding boxes for the grid cells. Now all you need to do then is compute the final scores for the bounding boxes and throw away the ones which are less than the threshold value i.e. 30%.
To know more about how YOLO and it's different versions work, I would suggest you to read here.
- The main challenge in victim detection would be to integrate YOLOv3 object detection algorithm with Thermal detection sensors’ data to identify whether the human victim (object) is alive or not.
- Another challenge would be the integration of our customised YOLOv3 algorithm with ROS.
YOLO is a very powerful and fast algorithm for object detection. I and my team members are looking forward to diving deep into the algorithm and of course extending this blog post on how we moved forward and implemented it.
We read the following papers and articles about YOLO which I think would be useful for you as well :
- “Rapid Object Detection using a Boosted Cascade of Simple Features, 2001 : Authored by Paul Viola & Michael Jones"
- “Histograms of Oriented Gradients for Human Detection, 2005 : Authored by Navneet Dalal and Bill Triggs"
- “ImageNet Classification with Deep Convolutional Neural Networks, 2012 : Authored by Alex Krizhevsky, Ilya Sutskever & Geoffrey E. Hinton"
- “Rich feature hierarchies for accurate object detection and semantic segmentation, 2014 : Authored by Ross Girshick, Jeff Donahue, Trevor Darrell & Jitendra Malik"
- “You Only Look Once: Unified, Real-Time Object Detection, 2016 : Authored by Joseph Redmon, Santosh Divvala, Ross Girshick & Ali Farhadi"
- “YOLO9000: Better, Faster, Stronger : YOLOv2, 2017 : Authored by Joseph Redmon & Ali Farhadi"
- "YOLOv3: An Incremental Improvement, 2018 : Authored by Joseph Redmon & Ali Farhadi"
- “The best explanation of Convolutional Neural Networks on the Internet! : Authored by Harsh Pokharna“
- “Real-time Object Detection with YOLO, YOLOv2 and now YOLOv3 : Authored by Jonathan Hui“