Object Detection using Faster-RCNN Pytorch¶
In this post, we will explore Faster-RCNN object detector with Pytorch. We will use the pretrained Faster-RCNN model with Resnet50 as the backbone.
Understanding model inputs and outputs:¶
The pretrained Faster-RCNN ResNet-50 model we are going to use expects the input image tensor to be in the form [n, c, h, w] where
- n is the number of images
- c is the number of channels , for RGB images its 3
- h is the height of the image
- w is the widht of the image
The model will return
- Bounding boxes [x0, y0, x1, y1] all all predicted classes of shape (N,4) where N is the number of classes predicted by the model to be present in the image.
- Labels of all predicted classes.
- Scores of each predicted label.
Load model¶
Now, we are loading the pretrained Faster-RCNN Resnet50 model, and also loading the COCO dataset category names.
# import necessary libraries
%matplotlib inline
import matplotlib.pyplot as plt
from PIL import Image
import torch
import torchvision.transforms as T
import torchvision
import numpy as np
import cv2
import warnings
warnings.filterwarnings('ignore')
# load model
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
# set to evaluation mode
model.eval()
# load the COCO dataset category names
# we will use the same list for this notebook
COCO_INSTANCE_CATEGORY_NAMES = [
'__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]
We can see some N/A’s in the list, as a few classes were removed in the later papers. We will go with the list given by Pytorch.
Object detection pipeline¶
We define two functions used for model inference.
- get_prediction take the img_path, and confidence as input, and returns predicted bounding boxes and classes.
- detect_object uses the get_prediction function and gives the visualization result.
def get_prediction(img_path, confidence):
"""
get_prediction
parameters:
- img_path - path of the input image
- confidence - threshold value for prediction score
method:
- Image is obtained from the image path
- the image is converted to image tensor using PyTorch's Transforms
- image is passed through the model to get the predictions
- class, box coordinates are obtained, but only prediction score > threshold
are chosen.
"""
img = Image.open(img_path)
transform = T.Compose([T.ToTensor()])
img = transform(img)
pred = model([img])
pred_class = [COCO_INSTANCE_CATEGORY_NAMES[i] for i in list(pred[0]['labels'].numpy())]
pred_boxes = [[(i[0], i[1]), (i[2], i[3])] for i in list(pred[0]['boxes'].detach().numpy())]
pred_score = list(pred[0]['scores'].detach().numpy())
pred_t = [pred_score.index(x) for x in pred_score if x>confidence][-1]
pred_boxes = pred_boxes[:pred_t+1]
pred_class = pred_class[:pred_t+1]
return pred_boxes, pred_class
def detect_object(img_path, confidence=0.5, rect_th=2, text_size=2, text_th=2):
"""
object_detection_api
parameters:
- img_path - path of the input image
- confidence - threshold value for prediction score
- rect_th - thickness of bounding box
- text_size - size of the class label text
- text_th - thichness of the text
method:
- prediction is obtained from get_prediction method
- for each prediction, bounding box is drawn and text is written
with opencv
- the final image is displayed
"""
boxes, pred_cls = get_prediction(img_path, confidence)
img = cv2.imread(img_path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
# print(len(boxes))
for i in range(len(boxes)):
cv2.rectangle(img, boxes[i][0], boxes[i][1],color=(0, 255, 0), thickness=rect_th)
cv2.putText(img,pred_cls[i], boxes[i][0], cv2.FONT_HERSHEY_SIMPLEX, text_size, (0,255,0),thickness=text_th)
plt.figure(figsize=(20,30))
plt.imshow(img)
plt.xticks([])
plt.yticks([])
plt.show()
Making predictions¶
Now we are ready to use the model to do inference. Let's look at a few examples.
Example 1
!wget -nv https://www.goodfreephotos.com/cache/other-photos/car-and-traffic-on-the-road-coming-towards-me.jpg -O traffic.jpg
detect_object('./traffic.jpg', confidence=0.7)
The result is a bit surprising. We not only detected the three cars in the picture, but also detect the person in the car which is very indistinct.
Example 2
!wget -nv https://pixnio.com/free-images/2018/12/10/2018-12-10-18-38-14-1196x900.jpg -O traffic2.jpg
detect_object('./traffic2.jpg', confidence=0.7)
It looks like we are getting quite accurate predictions with the model.
Example 3
!wget -nv https://storage.needpix.com/rsynced_images/pedestrian-zone-456909_1280.jpg -O pedestrian.jpg
detect_object('./pedestrian.jpg', confidence=0.7)
Comparing inference time for CPU and GPU¶
Let's take a look at the inference time of the model for CPU and GPU. I am using Google Colab to do the experiment.
import time
def check_inference_time(image_path, gpu=False):
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()
img = Image.open(image_path)
transform = T.Compose([T.ToTensor()])
img = transform(img)
if gpu:
model.cuda()
img = img.cuda()
else:
model.cpu()
img = img.cpu()
start_time = time.time()
pred = model([img])
end_time = time.time()
return end_time-start_time
cpu_time = sum([check_inference_time('./traffic.jpg', gpu=False) for _ in range(10)])/10.0
gpu_time = sum([check_inference_time('./traffic.jpg', gpu=True) for _ in range(10)])/10.0
print('\n\nAverage Time take by the model with GPU = {}s\nAverage Time take by the model with CPU = {}s'.format(gpu_time, cpu_time))
plt.bar([0.1, 0.2], [cpu_time, gpu_time], width=0.08)
plt.ylabel('Time/s')
plt.xticks([0.1, 0.2], ['CPU', 'GPU'])
plt.title('Inference time of Faster-RCNN with Resnet-50 backbone on CPU and GPU')
plt.show()
Using Google Colab, the inference time of the Faster-RCNN model on GPU is approximately 45 times faster than on CPU.
Comments
comments powered by Disqus