Region-based convolutional networks for accurate object detection and segmentation

What is CNN?

Source: https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53

Fig. 1. Architecture of convolutional neural networks

Artificial Intelligence and Deep Learning have been witnessing massive growth in current years because of the advancement of technology. Researchers and programmers work on multitudinous aspects of the field to make astounding things happen. One of the main domains is the sphere of Computer Vision.

The agenda for this field is to enable machines to examine the world as humans do, perceive it in an analogous manner, and indeed use the knowledge for a multitude of tasks similar to Image & Video recognition, Image Analysis & Classification, Object Detection, Language Processing, etc. The progress in Deep Learning applied to Computer Vision has been constructed and perfected with time, primarily over one method — a Convolutional Neural Network.

A Convolutional Neural Network - CNN is a type of neural network that differs from other neural networks because of its convolutional layer, the convolutional layer has a number of components; input data, filters (matrices used to extract features), and a feature map, the feature detector will iterate over the input and check if the feature is present by the use of filters, this process is known as convolution.

What is R-CNN?

Source: https://analyticsindiamag.com/r-cnn-vs-fast-r-cnn-vs-faster-r-cnn-a-comparative-guide/

Fig. 2. Architecture of region-based convolutional neural networks

Region-based CNN i.e., R-CNN is an object recognition model that uses CNN on various ‘regions’ to carry out various operations, segmentation, localization, etc. It uses selective search to identify several bounded areas and then extracts features from each of them for independently performing classification for each area, this technique is called region proposal. It was first innovated in the paper “Rich feature hierarchies for accurate object detection and semantic segmentation”.

An input image received by the R-CNN goes through various stages:

Region proposal algorithm with selective search recognizes and ‘selects’ various regions for passing through CNN.
CNN will extract features from each region by passing each region through it giving output.
Feature vectors are then ‘fed’ to an SVM (Support Vector Machine) for classification.

Object Detection and Segmentation with R-CNN

Source: https://www.semanticscholar.org/paper/Region-Based-Convolutional-Networks-for-Accurate-Girshick-Donahue

Fig. 3. R-CNN regions with CNN features

Object detection describes several tasks including detection, recognition, localization, and classification of objects within virtual media like images or videos.

Object Segmentation consists of the process of drawing the bounded region or area of an object within an image.

Selective Search

In computer vision, object detection, and segmentation is the process of detecting and segmenting objects in images. There are many different methods for doing this, but one of the most popular is selective search. Selective search is a method of object detection that starts by creating a set of initial proposals, or candidate regions, which are then refined using several different algorithms. Selective Search is a technique for finding objects in images. It is a bottom-up approach that starts with small objects and then expands them to larger ones. The idea is to take a small image and find all the objects in it. This is done by first finding all the edges in the image, then grouping the edges into objects. The process is then repeated with larger images until all the objects in the image have been found.

The goal is to eventually produce a set of high-quality object proposals, which can then be used for further processing, such as object classification or instance segmentation. The R-CNN (region-based convolutional neural network) is a type of neural network that is particularly well-suited for this task. R-CNNs have been used for a variety of different tasks, including object detection, semantic segmentation, and even human pose estimation. One of the advantages of using R-CNNs for object detection is that they can be trained end-to-end, meaning that the entire process can be learned from data, without the need for manual feature engineering. This is a significant advantage over traditional object detection methods, which often require a lot of hand-tuning to get good results.

The advantage of Selective Search is that it is fast and accurate. It is also able to find objects in images that are cluttered or have multiple objects. The disadvantages of Selective Search are that it is not as accurate as some other methods, and it can be slow.

Feature Extraction

Feature extraction is a process of dimensionality reduction whereby an input image is transformed into a set of features that can be used to describe the image. This is done by convolutional filters that extract features from the image based on their spatial relationship to one another. The extracted features are then input into a neural network for classification. A similar is true for the architecture present inside the R-CNN network. In the next step, a feature vector is created taking each region proposal, that was obtained in the previous step, and representing the image in a much smaller dimension using a Convolutional Neural Network (CNN).

For feature extraction, the popular ‘AlexNet’ Feature Extractor is used. It has 5 convolutional and 2 fully connected layers. This step primarily helps the convolution layer to learn basic image features. To learn a) The visual characteristics of the novel image types—distorted region proposals—and b) Specific target classes of the smaller dataset for the detection task, the network must now be fine-tuned. To extract the classes relevant to the detection task from the region proposals, the classification network is fine-tuned.

The input of the AlexNet is always the same with the size 227 x 227. The extent by which the initial bounding box might be dilated to take into account context from the surrounding area is indicated by an additional parameter p. Following is the structure of CNN.

Source: https://medium.com/@selfouly/r-cnn-3a9beddfd55a

Fig. 4. Block diagram for Feature extraction using CNN

With a (1/10)th of the initial pre-training learning rate, the network is trained using SGD. To make sure there is enough representation from the positive classes during training, they sample 32 windows that are positive across all classes in each iteration, together with 96 windows from the background class to create a mini-batch of 128.

Classification

Object detection requires the accurate classification of objects in all regions. Therefore, the features that are extracted in the previous stage undergo classification in this stage. In region-based classification neural networks, this classification is implemented using SVM which is followed by a bounding box regressor. The SVM algorithm is recommended in R-CNN for detection and classification purposes as it provides accurate results by separating the classes with a logical hyperplane in N-dimensional space. This implementation of SVM is improved in further advancements of R-CNN such as fast R-CNN and faster R-CNN by combining it with different transfer learning algorithms. These improvements resulted in increased accuracy and speed of detection.

In this stage, the feature vector generated after the extraction of features is passed to the binary SVM model. The model needs to be trained independently for each class. The SVM model takes the output of CNN architecture that is extracted feature vectors as an input. After the classification of feature vectors, it outputs the result in the form of a confidence score. The confidence score indicates the positive detection of the targeted object in the specified area.

In R-CNN, the SVM model can not be trained parallelly with the CNN architecture (maybe AlexNet, which is a CNN-based architecture for object detection). This is because we need the feature vectors provided by CNN architecture to train the SVM model. This drawback is resolved in fast R-CNN and faster R-CNN.

Types of R-CNN

1. Fast R-CNN

Fast R-CNN as its name suggests is a fast version of regular R-CNN, instead of using a max pool layer we utilize the region of interest pooling. ROI pooling produces feature maps of a single size from the various sized input regions by doing max-pooling of the inputs. The main advantage of this method is the use of a single feature map for all regions proposed enabling us to pass an entire image to the CNN.

2. Faster R-CNN

Faster R-CNN makes the R-CNN even faster, it uses a region proposal network rather than a selective search for creating a pool of regions, for this, it uses a unique CNN which we call a region proposal network. The network takes the feature map as preliminary input and region proposals as output.

3. Mask R-CNN

Mask R-CNN is a faster R-CNN with a Region of Interest Alignment Layer instead of an ROI pooling layer. This alignment layer uses bilinear interpolation to keep the spatial features on the feature maps, which is more proper for pixel-level projections. The output map of this network is of the same size for all regions of interest.

Comparison

 
     R-CNN
          Fast R-CNN
          Faster R-CNN
          Mask R-CNN
           Region proposal method
            Selective         search
S           Selective search
            Region proposal network
           Region proposal network
Pooling
          Max Pooling
            ROI Pooling
      ROI Pooling
        ROI Alignment
    Computation
            High  
             computation
         time
           High 
              computation          
                  time         
            Low 
           computation time
             Low 
          computation   time

Table. 1. Comparison between R-CNN, Fast R-CNN, Faster R-CNN, and Mask R-CNN

Conclusion

Region Based CNN’s and their variants work through the aforementioned techniques to reduce computation time and make neural networks more capable of object detection tasks with the combined use of feature maps, region searching techniques, and neural networks. R-CNN is a powerful and innovative tool in Deep Learning to help advance the field and its application in object detection and segmentation, the full potential of this technology is still being discovered and utilized for various problems in artificial intelligence. It is a pioneering technique in the hand of any Deep Learning engineer.

This blog is a small endeavor to compile and make accessible information regarding the same and to contribute to the pool of knowledge available.

Region-based convolutional networks for accurate object detection and segmentation

Search This Blog