AnyscaleMetadataProvider API
Check your docs version
This version of the Anyscale docs is deprecated. Go to the latest version for up to date information.
info
If you want to access this feature, contact the Anyscale team.
AnyscaleMetadataProvider
AnyscaleMetadataProvider(use_sampling: bool = False)
A metadata provider that implements proprietary optimizations.
Metadata providers fetch information like file sizes. Ray Data uses this information to effectively parallelize reads.
Parameters
use_sampling
: If your input paths point to files and your files have similar sizes, set this toTrue
. It optimizes reads by fetching less metadata.
warning
If your input paths point to directories, don't use sampling. Your program will exhibit undefined behavior.
Examples
Pass AnyscaleFileMetadataProvider
to functions like read_images
.
import ray
from ray_extensions.data import AnyscaleFileMetadataProvider
ds = ray.data.read_images(
"s3://anonymous@air-example-data/AnimalDetection",
meta_provider=AnyscaleFileMetadataProvider()
)
If your input paths point to files and your files have similar sizes, enable sampling to speed up your program.
import ray
from ray_extensions.data import AnyscaleFileMetadataProvider
paths = [
f"s3://anonymous@air-example-data/AnimalDetection/JPEGImages/2007_{i:06d}.jpg"
for i in range(5000)
]
ds = ray.data.read_images(
paths,
meta_provider=AnyscaleFileMetadataProvider(use_sampling=True),
ignore_missing_paths=True
)