Instance Segmentation using Mask-RCNN and PyTorch

Instance Segmentation is a combination of 2 problems

  • Object Detection
  • Semantic Segmentation

In this post, we will explore Mask-RCNN object detector with Pytorch. We will use the pretrained Mask-RCNN model with Resnet50 as the backbone.

Understanding model inputs and outputs:

The pretrained Faster-RCNN ResNet-50 model we are going to use expects the input image tensor to be in the form [n, c, h, w] where

  • n is the number of images
  • c is the number of channels , for RGB images its 3
  • h is the height of the image
  • w is the widht of the image

The model will return

  • boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between 0 and H and 0 and W

  • labels (Tensor[N]): the predicted labels for each image

  • scores (Tensor[N]): the scores or each prediction

  • masks (Tensor[N, H, W]): the predicted masks for each instance, in 0-1 range. In order to obtain the final segmentation masks, the soft masks can be thresholded, generally with a value of 0.5 (mask >= 0.5)labels

Load model

Now, we are loading the pretrained Mask-RCNN Resnet50 model, and also loading the COCO dataset category names.

In [0]:
# import necessary libraries
%matplotlib inline
from PIL import Image
import matplotlib.pyplot as plt
import torch
import torchvision.transforms as T
import torchvision
import numpy as np

import cv2
import random
import warnings
warnings.filterwarnings('ignore')
In [2]:
# load model
model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
# set to evaluation mode
model.eval()

# load COCO category names
COCO_CLASS_NAMES = [
    '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
    'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
    'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
    'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
    'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
    'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
    'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
    'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
    'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
    'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]
Downloading: "https://download.pytorch.org/models/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth" to /root/.cache/torch/checkpoints/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth
[(0.8131868131868139, 0.0, 1.0), (0.0, 1.0, 0.9670329670329672), (1.0, 0.2637362637362637, 0.0), (1.0, 0.5934065934065934, 0.0), (0.9450549450549453, 0.0, 1.0), (0.0, 0.703296703296703, 1.0), (0.4175824175824179, 0.0, 1.0), (0.41758241758241743, 1.0, 0.0), (0.0, 1.0, 0.7692307692307692), (0.6813186813186813, 1.0, 0.0), (0.0, 0.5054945054945055, 1.0), (0.4835164835164836, 1.0, 0.0), (1.0, 0.0, 0.197802197802198), (1.0, 0.6593406593406593, 0.0), (0.0, 1.0, 0.4395604395604398), (0.4835164835164836, 0.0, 1.0), (1.0, 0.0, 0.0659340659340657), (0.0, 0.6373626373626373, 1.0), (1.0, 0.3296703296703296, 0.0), (0.6813186813186816, 0.0, 1.0), (0.3516483516483515, 1.0, 0.0), (0.0, 1.0, 0.5054945054945055), (1.0, 0.0, 0.9230769230769234), (1.0, 0.0, 0.7912087912087911), (0.0219780219780219, 1.0, 0.0), (1.0, 0.0, 0.5274725274725274), (0.945054945054945, 1.0, 0.0), (0.0, 1.0, 0.901098901098901), (1.0, 0.0, 0.3296703296703303), (0.8791208791208796, 0.0, 1.0), (0.0, 0.3076923076923075, 1.0), (0.0, 1.0, 0.5714285714285712), (0.7472527472527473, 1.0, 0.0), (1.0, 0.0, 0.7252747252747254), (0.0, 1.0, 0.37362637362637363), (0.2197802197802199, 0.0, 1.0), (1.0, 0.0, 0.2637362637362637), (0.6153846153846154, 1.0, 0.0), (1.0, 0.46153846153846156, 0.0), (0.0, 0.9010989010989006, 1.0), (0.6153846153846159, 0.0, 1.0), (1.0, 0.7252747252747253, 0.0), (1.0, 0.39560439560439564, 0.0), (0.0, 1.0, 0.7032967032967035), (1.0, 0.0, 0.989010989010989), (1.0, 0.0, 0.6593406593406597), (0.0219780219780219, 0.0, 1.0), (0.8131868131868132, 1.0, 0.0), (1.0, 0.0, 0.1318681318681323), (0.0, 0.1758241758241761, 1.0), (0.0, 0.5714285714285716, 1.0), (0.0, 1.0, 0.8351648351648349), (1.0, 0.0, 0.395604395604396), (0.5494505494505495, 1.0, 0.0), (0.0, 0.1098901098901095, 1.0), (0.0, 1.0, 0.30769230769230793), (0.0, 1.0, 0.0439560439560438), (0.15384615384615374, 1.0, 0.0), (0.2857142857142856, 0.0, 1.0), (1.0, 0.9230769230769231, 0.0), (0.0, 1.0, 0.17582417582417564), (0.3516483516483513, 0.0, 1.0), (0.0, 0.9670329670329672, 1.0), (0.0, 0.4395604395604398, 1.0), (0.2197802197802199, 1.0, 0.0), (1.0, 0.0, 0.5934065934065931), (1.0, 0.06593406593406592, 0.0), (0.5494505494505493, 0.0, 1.0), (1.0, 0.13186813186813184, 0.0), (1.0, 0.0, 0.8571428571428577), (0.0, 1.0, 0.6373626373626373), (1.0, 0.8571428571428571, 0.0), (0.08791208791208804, 1.0, 0.0), (0.0, 1.0, 0.2417582417582418), (1.0, 0.0, 0.4615384615384617), (1.0, 0.0, 0.0), (0.0, 0.3736263736263741, 1.0), (0.0, 0.7692307692307692, 1.0), (1.0, 0.19780219780219777, 0.0), (0.0879120879120876, 0.0, 1.0), (1.0, 0.7912087912087913, 0.0), (0.2857142857142858, 1.0, 0.0), (0.0, 0.8351648351648349, 1.0), (0.0, 0.0439560439560438, 1.0), (0.0, 0.2417582417582418, 1.0), (0.1538461538461533, 0.0, 1.0), (1.0, 0.989010989010989, 0.0), (0.0, 1.0, 0.10989010989010994), (1.0, 0.5274725274725275, 0.0), (0.7472527472527473, 0.0, 1.0), (0.8791208791208791, 1.0, 0.0)]

Instance segmentation pipeline

We define three util functions used for model inference.

  • get_colored_mask get the colored mask for a specific class label in the image
  • get_prediction take the img_path, and confidence as input, and returns predicted bounding boxes, classes, and masks.
  • segment_instance uses the get_prediction function and gives the visualization result.
In [0]:
def get_coloured_mask(mask):
  """
  random_colour_masks
    parameters:
      - image - predicted masks
    method:
      - the masks of each predicted object is given random colour for visualization
  """
  colours = [[0, 255, 0],[0, 0, 255],[255, 0, 0],[0, 255, 255],[255, 255, 0],[255, 0, 255],[80, 70, 180],[250, 80, 190],[245, 145, 50],[70, 150, 250],[50, 190, 190]]
  r = np.zeros_like(mask).astype(np.uint8)
  g = np.zeros_like(mask).astype(np.uint8)
  b = np.zeros_like(mask).astype(np.uint8)
  r[mask == 1], g[mask == 1], b[mask == 1] = colours[random.randrange(0,10)]
  coloured_mask = np.stack([r, g, b], axis=2)
  return coloured_mask
In [0]:
def get_prediction(img_path, confidence):
  """
  get_prediction
    parameters:
      - img_path - path of the input image
      - confidence - threshold to keep the prediction or not
    method:
      - Image is obtained from the image path
      - the image is converted to image tensor using PyTorch's Transforms
      - image is passed through the model to get the predictions
      - masks, classes and bounding boxes are obtained from the model and soft masks are made binary(0 or 1) on masks
        ie: eg. segment of cat is made 1 and rest of the image is made 0
    
  """
  img = Image.open(img_path)
  transform = T.Compose([T.ToTensor()])
  img = transform(img)
  pred = model([img])
  pred_score = list(pred[0]['scores'].detach().numpy())
  pred_t = [pred_score.index(x) for x in pred_score if x>confidence][-1]
  masks = (pred[0]['masks']>0.5).squeeze().detach().cpu().numpy()
  # print(pred[0]['labels'].numpy().max())
  pred_class = [COCO_CLASS_NAMES[i] for i in list(pred[0]['labels'].numpy())]
  pred_boxes = [[(i[0], i[1]), (i[2], i[3])] for i in list(pred[0]['boxes'].detach().numpy())]
  masks = masks[:pred_t+1]
  pred_boxes = pred_boxes[:pred_t+1]
  pred_class = pred_class[:pred_t+1]
  return masks, pred_boxes, pred_class
In [0]:
def segment_instance(img_path, confidence=0.5, rect_th=2, text_size=2, text_th=2):
  """
  segment_instance
    parameters:
      - img_path - path to input image
      - confidence- confidence to keep the prediction or not
      - rect_th - rect thickness
      - text_size
      - text_th - text thickness
    method:
      - prediction is obtained by get_prediction
      - each mask is given random color
      - each mask is added to the image in the ration 1:0.8 with opencv
      - final output is displayed
  """
  masks, boxes, pred_cls = get_prediction(img_path, confidence)
  img = cv2.imread(img_path)
  img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
  for i in range(len(masks)):
    rgb_mask = get_coloured_mask(masks[i])
    img = cv2.addWeighted(img, 1, rgb_mask, 0.5, 0)
    cv2.rectangle(img, boxes[i][0], boxes[i][1],color=(0, 255, 0), thickness=rect_th)
    cv2.putText(img,pred_cls[i], boxes[i][0], cv2.FONT_HERSHEY_SIMPLEX, text_size, (0,255,0),thickness=text_th)
  plt.figure(figsize=(20,30))
  plt.imshow(img)
  plt.xticks([])
  plt.yticks([])
  plt.show()
  

Making predictions

Now we are ready to use the model to do inference. Let's look at a few examples. Here we are using the same examples as we used in the Faster-RCNN object detection post. As Mask-RCNN and Faster-RCNN share the same structure, so similar prediction results are expected.

Example 1

In [16]:
!wget -nv https://www.goodfreephotos.com/cache/other-photos/car-and-traffic-on-the-road-coming-towards-me.jpg -O traffic.jpg
segment_instance('./traffic.jpg', confidence=0.7)
2020-06-14 22:49:51 URL:https://www.goodfreephotos.com/cache/other-photos/car-and-traffic-on-the-road-coming-towards-me_800.jpg?cached=1522560655 [409997/409997] -> "traffic.jpg" [1]

The result is a bit surprising. We not only detected the three cars in the picture, but also detect the person in the car which is very indistinct.

Example 2

In [17]:
!wget -nv https://pixnio.com/free-images/2018/12/10/2018-12-10-18-38-14-1196x900.jpg -O traffic2.jpg
segment_instance('./traffic2.jpg', confidence=0.7)
2020-06-14 23:04:05 URL:https://pixnio.com/free-images/2018/12/10/2018-12-10-18-38-14-1196x900.jpg [189333/189333] -> "traffic2.jpg" [1]

It looks like we are getting quite accurate predictions with the model.

Example 3

In [19]:
!wget -nv https://storage.needpix.com/rsynced_images/pedestrian-zone-456909_1280.jpg -O pedestrian.jpg
segment_instance('./pedestrian.jpg', confidence=0.7)
2020-06-14 23:04:51 URL:https://storage.needpix.com/rsynced_images/pedestrian-zone-456909_1280.jpg [409534/409534] -> "pedestrian.jpg" [1]

Comparing inference time for CPU and GPU

Let's take a look at the inference time of the model for CPU and GPU. I am using Google Colab to do the experiment.

In [0]:
import time

def check_inference_time(image_path, gpu=False):
  model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
  model.eval()
  img = Image.open(image_path)
  transform = T.Compose([T.ToTensor()])
  img = transform(img)
  if gpu:
    model.cuda()
    img = img.cuda()
  else:
    model.cpu()
    img = img.cpu()
  start_time = time.time()
  pred = model([img])
  end_time = time.time()
  return end_time-start_time
In [22]:
cpu_time = sum([check_inference_time('./traffic.jpg', gpu=False) for _ in range(10)])/10.0
gpu_time = sum([check_inference_time('./traffic.jpg', gpu=True) for _ in range(10)])/10.0


print('\n\nAverage Time take by the model with GPU = {}s\nAverage Time take by the model with CPU = {}s'.format(gpu_time, cpu_time))

Average Time take by the model with GPU = 0.18893978595733643s
Average Time take by the model with CPU = 5.276700663566589s
In [23]:
plt.bar([0.1, 0.2], [cpu_time, gpu_time], width=0.08)
plt.ylabel('Time/s')
plt.xticks([0.1, 0.2], ['CPU', 'GPU'])
plt.title('Inference time of Mask-RCNN with Resnet-50 backbone on CPU and GPU')
plt.show()


Comments

comments powered by Disqus