Obstacle Detection and Avoidance in Indian Roads using YOLOv3 and Mask RCNN ensemble
Obstacle detection is a critical task in developing autonomous vehicles, and while sensors like SONARs and LiDARs are accurate in real-time, they can be cost-inefficient. To address this, extensive research has focused on computer vision and deep learning-based approaches. In India, where road conditions pose numerous challenges leading to high accident rates, the demand for automated driving solutions is evident. This study utilizes a custom dataset with 900 annotations from 512-day images across 10 classes. The proposed system combines YOLOv7 and Mask RCNN through an ensemble method and outperforms contemporary research on popular obstacle detection datasets like INRIA, KAIST, and COCO2017. Additionally, an obstacle avoidance system based on fuzzy logic is implemented, using various lengths of masks to navigate around detected obstacles. This system has been successfully tested on real-world dash cam videos, showcasing its capability to avoid obstacles effectively.
Dataset
The dataset was meticulously curated using a dash cam mounted on the car's windshield, positioned behind the rear-view mirror. The dash cam recorded videos, saving them at a rate of 1 video per minute. These videos were then converted into frames, where each second of video corresponded to 30 frames. To capture varying obstacle scenarios, frames were saved at 3-second intervals, resulting in 90 frames per minute. The identified obstacles in these frames were meticulously annotated for training purposes.
After annotating the frames, the total no of annotations was 914 across 512 images spread around 10 classes. The ten classes are listed below:
-
Speed breaker
-
Divider
-
Car
-
Tata ace
-
Bus
-
Barricade
-
Auto
-
Potholes
-
Bike
-
Pedestrian

Obstacle detection: Ensemble approach
Mask R-CNN is a state of the art model for instance segmentation, developed on top of Faster R-CNN. Faster R-CNN is a region-based convolutional neural networks, that returns bounding boxes for each object and its class label with a confidence score.
To understand Mask R-CNN, let's first discus architecture of Faster R-CNN that works in two stages:
The first stage consists of two networks, backbone (ResNet, VGG, Inception, etc..) and region proposal network. These networks run once per image to give a set of region proposals. Region proposals are regions in the feature map which contain the object.
In the second stage, the network predicts bounding boxes and object class for each of the proposed region obtained in stage1. Each proposed region can be of different size whereas fully connected layers in the networks always require fixed size vector to make predictions. Size of these proposed regions is fixed by using either RoI pool (which is very similar to MaxPooling) or RoIAlign method.
YOLO models are single stage object detectors. In a YOLO model, image frames are featurized through a backbone. These features are combined and mixed in the neck, and then they are passed along to the head of the network YOLO predicts the locations and classes of objects around which bounding boxes should be drawn.
The efficiency of the YOLO networks convolutional layers in the backbone is essential to efficient inference speed. WongKinYiu started down the path of maximal layer efficiency with Cross Stage Partial Networks.
In YOLOv7, the authors build on research that has happened on this topic, keeping in mind the amount of memory it takes to keep layers in memory along with the distance that it takes a gradient to back-propagate through the layers. The shorter the gradient, the more powerfully their network will be able to learn. The final layer aggregation they choose is E-ELAN, an extend version of the ELAN computational block.


Obstacle Avoidance
The avoidance masks utilized by the fuzzy logic system play a vital role in determining the distance between obstacles and the vehicle. The vehicle's speed is a crucial factor in this calculation, as higher speeds require a clearance of about 9 meters to ensure safety and obstacle avoidance, whereas moderate speeds may necessitate clearances of 6 and 3 meters. To accommodate these varying distances, three sets of avoidance masks have been generated, each corresponding to a fixed distance on the ground. The blue lines cover the area representing 3-meter clearance, while the green and red lines indicate the areas with 6 and 9-meter clearances, respectively. The distance between the colored circles signifies the width of the vehicle, ensuring that any object outside these masks will not impede the vehicle's movement.

Results
The following figures depicts the working of the obstacle detection and obstacle avoidance pipeline. If the mask of the object is colored red, it depicts that the object would hinder the vehicle. If the mask of the object is colored green, it depicts that the object wouldn’t hinder the vehicle.





