Batch inference: detect objects in videos

Check your docs version

This version of the Anyscale docs is deprecated. Go to the latest version for up to date information.

This guide shows you how to detect objects in videos using Torch and Ray Data. If you want to detect objects in terabytes of videos, this guide is for you.

You'll read videos, preprocess frames, run a model, and save detection results.

tip

Want faster and cheaper offline batch inference? Fill out this form.

Prerequisites

Before you begin, complete the following steps:

1. Install Decord

The video API depends on Decord to load video frames. To install it, run the following command:

pip install --user decord

To learn more about installing dependencies into your environment, see Anyscale environment.

2. Read video frames

VideoDatasource is the primary API for loading videos. It loads video frames into a Dataset, where each row represents a single frame.

Call ray.data.read_datasource() and pass in a VideoDatasource.

import ray
from ray.anyscale.data import VideoDatasource

dataset = ray.data.read_datasource(
    VideoDatasource(),
    # The videos used here are free of use from pexels.com.
    # Credits go to Pavel Danilyuk (https://www.pexels.com/@pavel-danilyuk/).
    paths="s3://anonymous@ray-example-data/video-dataset/",
    # To determine which video a frame belongs to, ​set `include_paths=True`.
    include_paths=True,
)

Next, call Dataset.take() to inspect rows.

rows = dataset.take(1)

Each row should look like this:

{'frame': array([[[...]]], dtype=uint8), 'frame_index': 0, 'path': 'ray-example-data/video-dataset/sample_video_000.mp4'}

You can visualize frames with PIL:

from PIL import Image

Image.fromarray(rows[0]["frame"]).show()

3. Preprocess frames

Call Dataset.map() to preprocess your dataset. transform converts each image's dtype from uint8 to float and scales the values accordingly.

from typing import Dict

import numpy as np
from torchvision import transforms
from torchvision.models.detection import FasterRCNN_ResNet50_FPN_V2_Weights

weights = FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT
transform = transforms.Compose([transforms.ToTensor(), weights.transforms()])

def transform_frame(row: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]:
    row["frame"] = transform(row["frame"])
    return row

dataset = dataset.map(transform_frame)

4. Detect objects

Implement a callable that performs inference. Set up your model in __init__ and invoke the model in __call__.

import torch
from torchvision.models.detection import fasterrcnn_resnet50_fpn_v2

class DetectObjects:
    def __init__(self):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model = fasterrcnn_resnet50_fpn_v2(weights=weights, box_score_thresh=0.9)
        self.model.to(self.device)
        self.model.eval()
​
    def __call__(self, batch: Dict[str, np.ndarray]):
        inputs = [torch.from_numpy(frame).to(self.device) for frame in batch["frame"]]
        with torch.inference_mode():
            outputs = self.model(inputs)
        return {
            "path": batch["path"],
            "frame": batch["frame"],
            "frame_index": batch["frame_index"],
            "labels": [output["labels"].detach().cpu().numpy() for output in outputs],
            "boxes": [output["boxes"].detach().cpu().numpy() for output in outputs],
        }

Then, call Dataset.map_batches(). You should configure num_gpus_in_cluster appropriately.

num_gpus_in_cluster = 1
results = dataset.map_batches(
    DetectObjects,
    compute=ray.data.ActorPoolStrategy(size=num_gpus_in_cluster),
    batch_size=4,  # Choose the largest batch size that fits in GPU memory
    num_gpus=1,  # Number of GPUs per worker
)

5. Inspect results

Call Dataset.take() to inspect the inference results.

rows = results.take(1)

Each result row should look like this:

{'path': 'ray-example-data/video-dataset/sample_video_000.mp4', 'frame': array([[[...]]], dtype=float), 'frame_index': 46, 'labels': array([ 1, 86, 64, 64, 64]), 'boxes': array([[...]], dtype=float32)}]

You can visualize the results with TorchVision's draw_bounding_boxes().

from torchvision.utils import draw_bounding_boxes
import torchvision.transforms.functional as F

image = torch.as_tensor(rows[0]["frame"] * 255, dtype=torch.uint8)
boxes = torch.as_tensor(rows[0]["boxes"])
image_with_boxes = draw_bounding_boxes(image, boxes, colors="red")
F.to_pil_image(image_with_boxes).show()

6. Write results to S3

Call Dataset.write_parquet() and pass in a URI pointing to a folder in S3. Your nodes must have write access to the folder. To write results to other formats, see Saving tensor data.

results.write_parquet("s3://sample-bucket/my-inference-results")

Prerequisites​

1. Install Decord​

2. Read video frames​

3. Preprocess frames​

4. Detect objects​

5. Inspect results​

6. Write results to S3​