DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

3 min readDec 26, 2020

Introduction: This articles contains my views on the paper published by By : Liang-Chieh Chen, George Papandreou, Senior Member, IEEE, Iasonas Kokkinos, Member, IEEE, Kevin Murphy, and Alan L. Yuille, Fellow, IEEE.

Understanding some terms:

Semantic Segmentation: Semantic segmentation: Task of labelling pixels in an image by their object classes. It play a significant role in computer vision and pattern recognition systems. Uses is autonomous driving, live videos etc. Image classification, object detection, image classification tasks, scene understanding, scene parsing, shape and size detection.

Atrous Convolution:

It deals with upsampling in the filters which is also known as dialated convolution. It is commonly used in wavelet transform and right now it is applied in convolutions for deep learning. The term “Atrous” indeed comes from French “à trous” meaning hole. Thus, it is also called “algorithme à trous” and “hole algorithm”.Used in Dense fields.

Benefits of Atrous Convolution

•Atrous convolution allows us to enlarge the field of view of filters to incorporate larger context.

•Controls the resolution at which the feature responses are computed in Deep Learning.

Finds the best trade-off between accurate localization and content assimilation.

When r=1 standard convolution is used and when r>1, then the atrous convolution is taken into the account.

Atrous Spatial Pyramid Pooling (ASPP)

•ASPP is used to segment objects at multiple stages. It probes incoming convolutional feature layer with filters at multiple sampling rates.

•It also captures the image context and object at multiple scales which thereby improves the accuracy of the image.

In ASPP, parallel atrous convolution with different rate applied in the input feature map, and fuse together.

In the following image, we can see the pyramidal like structure with multiple altered rates used in ASPP. This type of filters are used in pooling and the image accuracy can be determined.

CRF models uses bilateral filters in first kernels and gaussian filters in the second kernel.

Chanllenges in the paper:

First Challenge:

Repeated combination of max-pooling and down sampling (‘striding’) in consecutive layers.

Sol:

Remove down sampling operators from the last few max pooling layers of DCNNs and instead up sample the filters feature maps in higher sampling rates. It is done using atrous convolutional methods which helps in dense prediction tasks.

Second Challenge

Caused by existence of objects at multiple scales.

Solution:

Present to the DCNN rescaled version of the same image and aggregate the feature or score maps.

This is taken into the account by the use of Atrous Spacial Pyramid Pooling by segmenting the objects at multiple scales.

Third Challenge

Reduced localization accuracy

Sol:

Object centric classifier requires invariance to special transformation which limits the spacial accuracy of DCNN.

This is usually taken under by using Conditional Random Fields by combining responses at final DCNN.

Atrous convolution along with fully connected CRFs in the presence of Bi-linear interpolation dictates a very accurate and clear image as we can see here.

Result:

Finally using these algorithms the accuracy got upto 77.65 which was around 68% in ResNet-101.