Ray Data metadata API
AnyscaleMetadataProvider
AnyscaleMetadataProvider(use_sampling: bool = False)
A metadata provider that implements proprietary optimizations.
Metadata providers fetch information like file sizes. Ray Data uses this information to effectively parallelize reads.
Parameters
use_sampling: If your input paths point to files and your files have similar sizes, set this toTrue. It optimizes reads by fetching less metadata.
warning
If your input paths point to directories, don't use sampling. Your program will exhibit undefined behavior.
Examples
Pass AnyscaleFileMetadataProvider to functions like read_images.
import ray
from ray_extensions.data import AnyscaleFileMetadataProvider
ds = ray.data.read_images(
"s3://anonymous@air-example-data/AnimalDetection",
meta_provider=AnyscaleFileMetadataProvider()
)
If your input paths point to files and your files have similar sizes, enable sampling to speed up your program.
import ray
from ray_extensions.data import AnyscaleFileMetadataProvider
paths = [
f"s3://anonymous@air-example-data/AnimalDetection/JPEGImages/2007_{i:06d}.jpg"
for i in range(5000)
]
ds = ray.data.read_images(
paths,
meta_provider=AnyscaleFileMetadataProvider(use_sampling=True),
ignore_missing_paths=True
)