Best practices for handling large datasets
This guide highlights best practices when using a distributed cluster like the Anyscale workspace. Understanding the difference between developing locally and using an Anyscale workspace sets the context for these recommendations.
In traditional development, you might be used to:
- Keeping all data files alongside your code
- Loading entire datasets into memory
- Running everything on a single machine
Anyscale's distributed environment works differently:
- Your code runs across multiple machines (nodes)
- The head node orchestrates work across worker nodes
- Data needs to be accessible to all nodes efficiently
Data storage options: When to use what
Option 1: Working directory of workspace (good for small files)
For small files (code, configs, small datasets), storing directly in your workspace works well:
import ray
from ray.data import read_csv
# Reading a small CSV file from workspace.
dataset = read_csv("/home/ray/default/data/small_config.csv")
The working directory of a workspace is typically best for:
- Configuration files
- Small datasets (up to a few hundred MB)
- Temporary results
- Quick experiments
The following are considerations for choosing workspace storage:
- Workspace storage limits depend on your cluster configuration.
- Anyscale snapshots all files every 5 minutes (unless ignored).
- Add large or temporary files to
.anyscaleignore
to prevent snapshot bloat. - For persistence, Ray packs the default working directory with a limit of ~10GB.
- For task submission, Ray packs the CURRENT working directory, and the limit is 500MiB. This is a Ray Core limit.
# .anyscaleignore
temp_results/
*.tmp
intermediate_data/
Option 2: NFS storage
NFS provides shared storage that can be accessed by all nodes:
# Reading from NFS storage
dataset = read_csv("/mnt/cluster_storage/medium_dataset.csv")
NFS storage is typically best for:
- Medium-sized datasets (up to 10GB)
- Files shared across workspaces, jobs, or services
- Development and iteration workflows
- Temporary storage of model weights
- Scenarios where object storage setup isn't necessary
The following are considerations for choosing NFS:
- Performance may degrade with high disk I/O operations
- Access from /mnt/cluster_storage on all nodes
- Better suited for development than production workloads
Option 3: Cloud object storage (best for larger datasets)
For workloads with substantial data, use object storage:
import ray
from ray.data import read_parquet
# Reading from cloud storage.
dataset = read_parquet("s3://your-bucket/large-dataset/")
Cloud object storage is typically best for:
- Large datasets (GB or TB scale)
- Data that needs to persist beyond workspace lifecycle
- Shared datasets used by multiple workspaces
- Production pipelines
Common pitfalls
Pitfall 1: Memory limitations
Problem: Loading a large dataset into the head node memory
# DON'T DO THIS with large datasets.
import pandas as pd
df = pd.read_csv("/home/ray/default/huge_dataset.csv") # Could crash your workspace.
Solution: Use Ray Data to stream and process data
# DO THIS instead.
import ray
from ray.data import read_csv
dataset = read_csv("s3://your-bucket/huge_dataset.csv")
Pitfall 2: Workspace snapshot limits
Problem: Including large temporary files in snapshots
# This generates a large file that Anyscale includes in snapshots.
with open("/home/ray/default/large_temp_file.dat", "wb") as f:
f.write(b"\0" * 5_000_000_000) # 5GB file
Solution: Use .anyscaleignore
or store in the object storage
# In .anyscaleignore
large_temp_file.dat
Pitfall 3: Data transfer bottlenecks
Problem: Transferring data from head to workers in inefficient ways
# DON'T DO THIS with large data.
@ray.remote
def process(data):
# Data gets serialized and sent to each worker.
return process_func(data)
big_data = load_big_dataset() # On the head node.
futures = [process.remote(big_data) for _ in range(10)] # Copies sent to each worker.
Solution: Use Ray Data for efficient distribution
# DO THIS instead.
dataset = ray.data.read_parquet("s3://your-bucket/data/")
processed = dataset.map_batches(process_func)
Conclusion
- Small files can live in your workspace—just manage them with
.anyscaleignore
- Larger datasets belong in cloud object storage
- Let Ray Data handle the heavy lifting of distributed data processing