Skip to main content

Ray Data metadata API

AnyscaleMetadataProvider

AnyscaleMetadataProvider(use_sampling: bool = False)

A metadata provider that implements proprietary optimizations.

Metadata providers fetch information like file sizes. Ray Data uses this information to effectively parallelize reads.

Parameters

  • use_sampling: If your input paths point to files and your files have similar sizes, set this to True. It optimizes reads by fetching less metadata.
warning

If your input paths point to directories, don't use sampling. Your program will exhibit undefined behavior.

Examples

Pass AnyscaleFileMetadataProvider to functions like read_images.

import ray
from ray_extensions.data import AnyscaleFileMetadataProvider

ds = ray.data.read_images(
"s3://anonymous@air-example-data/AnimalDetection",
meta_provider=AnyscaleFileMetadataProvider()
)

If your input paths point to files and your files have similar sizes, enable sampling to speed up your program.

import ray
from ray_extensions.data import AnyscaleFileMetadataProvider

paths = [
f"s3://anonymous@air-example-data/AnimalDetection/JPEGImages/2007_{i:06d}.jpg"
for i in range(5000)
]
ds = ray.data.read_images(
paths,
meta_provider=AnyscaleFileMetadataProvider(use_sampling=True),
ignore_missing_paths=True
)