Batch inference: detect objects in videos
This version of the Anyscale docs is deprecated. Go to the latest version for up to date information.
This guide shows you how to detect objects in videos using Torch and Ray Data. If you want to detect objects in terabytes of videos, this guide is for you.
You'll read videos, preprocess frames, run a model, and save detection results.
Want faster and cheaper offline batch inference? Fill out this form.
Prerequisites
Before you begin, complete the following steps:
- Onboard onto Anyscale.
- Configure write access to a S3 bucket.
- Create a workspace with the ML image.
1. Install Decord
The video API depends on Decord to load video frames. To install it, run the following command:
pip install --user decord
To learn more about installing dependencies into your environment, see Anyscale environment.
2. Read video frames
VideoDatasource is the primary API for loading videos. It loads video frames into a Dataset, where each row represents a single frame.
Call ray.data.read_datasource() and pass in a VideoDatasource.
import ray
from ray.anyscale.data import VideoDatasource
dataset = ray.data.read_datasource(
VideoDatasource(),
# The videos used here are free of use from pexels.com.
# Credits go to Pavel Danilyuk (https://www.pexels.com/@pavel-danilyuk/).
paths="s3://anonymous@ray-example-data/video-dataset/",
# To determine which video a frame belongs to, set `include_paths=True`.
include_paths=True,
)
Next, call Dataset.take() to inspect rows.
rows = dataset.take(1)
Each row should look like this:
{'frame': array([[[...]]], dtype=uint8), 'frame_index': 0, 'path': 'ray-example-data/video-dataset/sample_video_000.mp4'}
You can visualize frames with PIL:
from PIL import Image
Image.fromarray(rows[0]["frame"]).show()
3. Preprocess frames
Call Dataset.map()
to preprocess your dataset. transform
converts each image's dtype from uint8 to float
and scales the values accordingly.
from typing import Dict
import numpy as np
from torchvision import transforms
from torchvision.models.detection import FasterRCNN_ResNet50_FPN_V2_Weights
weights = FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT
transform = transforms.Compose([transforms.ToTensor(), weights.transforms()])
def transform_frame(row: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]:
row["frame"] = transform(row["frame"])
return row
dataset = dataset.map(transform_frame)
4. Detect objects
Implement a callable that performs inference. Set up your model in __init__
and invoke
the model in __call__
.
import torch
from torchvision.models.detection import fasterrcnn_resnet50_fpn_v2
class DetectObjects:
def __init__(self):
self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.model = fasterrcnn_resnet50_fpn_v2(weights=weights, box_score_thresh=0.9)
self.model.to(self.device)
self.model.eval()
def __call__(self, batch: Dict[str, np.ndarray]):
inputs = [torch.from_numpy(frame).to(self.device) for frame in batch["frame"]]
with torch.inference_mode():
outputs = self.model(inputs)
return {
"path": batch["path"],
"frame": batch["frame"],
"frame_index": batch["frame_index"],
"labels": [output["labels"].detach().cpu().numpy() for output in outputs],
"boxes": [output["boxes"].detach().cpu().numpy() for output in outputs],
}
Then, call Dataset.map_batches()
. You should configure num_gpus_in_cluster
appropriately.
num_gpus_in_cluster = 1
results = dataset.map_batches(
DetectObjects,
compute=ray.data.ActorPoolStrategy(size=num_gpus_in_cluster),
batch_size=4, # Choose the largest batch size that fits in GPU memory
num_gpus=1, # Number of GPUs per worker
)
5. Inspect results
Call Dataset.take() to inspect the inference results.
rows = results.take(1)
Each result row should look like this:
{'path': 'ray-example-data/video-dataset/sample_video_000.mp4', 'frame': array([[[...]]], dtype=float), 'frame_index': 46, 'labels': array([ 1, 86, 64, 64, 64]), 'boxes': array([[...]], dtype=float32)}]
You can visualize the results with TorchVision's draw_bounding_boxes().
from torchvision.utils import draw_bounding_boxes
import torchvision.transforms.functional as F
image = torch.as_tensor(rows[0]["frame"] * 255, dtype=torch.uint8)
boxes = torch.as_tensor(rows[0]["boxes"])
image_with_boxes = draw_bounding_boxes(image, boxes, colors="red")
F.to_pil_image(image_with_boxes).show()
6. Write results to S3
Call Dataset.write_parquet() and pass in a URI pointing to a folder in S3. Your nodes must have write access to the folder. To write results to other formats, see Saving tensor data.
results.write_parquet("s3://sample-bucket/my-inference-results")