Skip to main content

AnyscaleMetadataProvider API

info

If you want to access this feature, contact the Anyscale team.


AnyscaleMetadataProvider

AnyscaleMetadataProvider(use_sampling: bool = False)

A metadata provider that implements proprietary optimizations.

Metadata providers fetch information like file sizes. Ray Data uses this information to effectively parallelize reads.

Parameters

  • use_sampling: If your input paths point to files and your files have similar sizes, set this to True. It optimizes reads by fetching less metadata.
warning

If your input paths point to directories, don't use sampling. Your program will exhibit undefined behavior.

Examples

Pass AnyscaleFileMetadataProvider to functions like read_images.

import ray
from ray_extensions.data import AnyscaleFileMetadataProvider

ds = ray.data.read_images(
"s3://anonymous@air-example-data/AnimalDetection",
meta_provider=AnyscaleFileMetadataProvider()
)

If your input paths point to files and your files have similar sizes, enable sampling to speed up your program.

import ray
from ray_extensions.data import AnyscaleFileMetadataProvider

paths = [
f"s3://anonymous@air-example-data/AnimalDetection/JPEGImages/2007_{i:06d}.jpg"
for i in range(5000)
]
ds = ray.data.read_images(
paths,
meta_provider=AnyscaleFileMetadataProvider(use_sampling=True),
ignore_missing_paths=True
)