Ray Data metadata API
AnyscaleMetadataProvider
AnyscaleMetadataProvider(use_sampling: bool = False)
A metadata provider that implements proprietary optimizations.
Metadata providers fetch information like file sizes. Ray Data uses this information to effectively parallelize reads.
Parameters
use_sampling
: If your input paths point to files and your files have similar sizes, set this toTrue
. It optimizes reads by fetching less metadata.
warning
If your input paths point to directories, don't use sampling. Your program will exhibit undefined behavior.
Examples
Pass AnyscaleFileMetadataProvider
to functions like read_images
.
import ray
from ray_extensions.data import AnyscaleFileMetadataProvider
ds = ray.data.read_images(
"s3://anonymous@air-example-data/AnimalDetection",
meta_provider=AnyscaleFileMetadataProvider()
)
If your input paths point to files and your files have similar sizes, enable sampling to speed up your program.
import ray
from ray_extensions.data import AnyscaleFileMetadataProvider
paths = [
f"s3://anonymous@air-example-data/AnimalDetection/JPEGImages/2007_{i:06d}.jpg"
for i in range(5000)
]
ds = ray.data.read_images(
paths,
meta_provider=AnyscaleFileMetadataProvider(use_sampling=True),
ignore_missing_paths=True
)