Storage
While file storage is optimal for storing your code, AI workloads often need access to large amounts of data, whether it's data for training and fine tuning, or a common storage to store model checkpoints.
This guide describes:
- Different types of storage
- How to access these different storage types
- How to configure them
Anyscale provides out-of-the box storage options for different use cases:
- Local storage for a node
- Object storage
- Storage shared across nodes
Local storage for a node
Anyscale supports Non-Volatile Memory Express (NVMe) interface to access SSD storage volumes, which provides additional temporary storage to the instances. Anyscale provides each node with its own volume and disk and doesn’t share them with other nodes. This storage option enables higher access speed, lower latency, and scalability. Acceess local storage at /mnt/local_storage
.
See NVMe configuration for details on how to configure it.
Object storage
For every Anyscale cloud, Anyscale configures the default object storage bucket during the cloud deployment. All workspace, job, and service clusters within an Anyscale cloud have permission to read and write to its default bucket.
Use the following environment variables to access the default bucket:
-
ANYSCALE_CLOUD_STORAGE_BUCKET
: the name of the storage bucket for the cloud. The bucket name format isanyscale-production-data-{cloud_id}
. The user can customize this format if they choose to bring their own bucket. -
ANYSCALE_CLOUD_STORAGE_BUCKET_REGION
: the region of the storage bucket for the cloud. -
ANYSCALE_ARTIFACT_STORAGE
: the URI to the pre-generated path for storing your artifacts while keeping them separate from Anyscale-generated ones.- AWS:
s3://<org_id>/<cloud_id>/artifact_storage/
- GCP:
gs://<org_id>/<cloud_id>/artifact_storage/
- AWS:
Within the bucket, Anyscale stores managed data in the {organization_id}/
path. For cloud-specific managed data, Anyscale further groups together the data into an {organization_id}/{cloud_id}
path. Anyscale stores some log files in legacy folders.
Anyscale writes system- or user-generated files, for example, log files, to this bucket. Don't delete or edit the Anyscale-managed files, which may lead to unexpected data loss. Deleting the data degrades the Anyscale platform experience for features such as log viewing, log downloading, and others. Use $ANYSCALE_ARTIFACT_STORAGE to separate your files from Anyscale-generated ones.
Anyscale offers 100 GB of free storage. If you need more storage, contact Anyscale support.
Storage shared across nodes
Anyscale mounts a Network File System (NFS) system automatically on workspace, job, and service clusters. Anyscale mounts 3 shared storage options by default for common permission groups.
/mnt/cluster_storage
is accessible to all nodes of a workspace, job, or service cluster./mnt/user_storage
is private to the Anyscale user but accessible from every node of all their workspace, job, and service clusters./mnt/shared_storage
is accessible to all Anyscale users of the same Anyscale cloud. Anyscale mounts it on every node of all the clusters in the same cloud.
NFS storage is accessible to all users on your Anyscale cloud. Don't put any sensitive data or secrets that you don't want users in your cloud to access.
Cluster storage
/mnt/cluster_storage
is a directory on NFS that Anyscale mounts on every node of the workspace, job, or service cluster and persists throughout the lifecycle of the cluster. This storage is useful for storing files that the head node and all the worker nodes need to access. For example:
- TensorBoard logs
- Common data files that all workers need to access with a stable path
Following are some behaviors to note about cluster storage:
- Anyscale doesn’t clone the cluster storage when you clone a workspace.
- New jobs and service updates launch new clusters.
/mnt/cluster_storage
doesn't persist in these cases.
User storage
/mnt/user_storage
is a directory on NFS specific to an Anyscale user. The user who creates the workspace, job, or service cluster can access this storage from every node. This storage is useful for storing files you need to use with multiple Anyscale , job, or service cluster.
Shared storage
/mnt/shared_storage
is a directory on NFS that all Anyscale users of the same Anyscale cloud can access. It's mounted on every node of every Anyscale cluster in the same cloud. This storage is useful for storing model checkpoints and other artifacts that you want to share with your team.
NFS storage usually has connection limits. Different cloud providers may have different limits. See Changing the default disk size for more information.
To increase the capacity of GCP Filestore instances, see the GCP documentation for more information.
Anyscale-hosted clouds use s3fs
to mount the shared storage.
Access storage
Upload and download files (workspaces)
- The easiest way to transfer small files to and from a workspace cluster is with the VS Code Web UI.
- You can also commit files to git and run
git pull
from the workspace cluster. - For large files, use object storage, for example, Amazon S3 or Google Cloud Storage, and access the data from there.
Upload and download data to object storage
Anyscale provides a default cloud storage path private to each cloud located at $ANYSCALE_ARTIFACT_STORAGE
. All nodes launched within your cloud should have access to read or write files at this path.
To copy files from your workspace cluster into cloud storage, you can use standard aws s3 cp
and gcloud storage cp
commands.
- Write to artifact storage
- Read from artifact storage
echo "hello world" > /tmp/input.txt
aws s3 cp /tmp/input.txt $ANYSCALE_ARTIFACT_STORAGE/saved.txt
aws s3 cp $ANYSCALE_ARTIFACT_STORAGE/saved.txt /tmp/output.txt
cat /tmp/output.txt
Anyscale scopes permissions on the cloud storage bucket backing $ANYSCALE_ARTIFACT_STORAGE
to only provide access to the specified path, so calls made to the root of the underlying bucket, for example, HeadObject
, may be rejected by Anyscale with an ACCESS_DENIED
error. Avoid making calls to any paths that don't explicitly have the $ANYSCALE_ARTIFACT_STORAGE/
prefix.
Access private cloud storage
To access private cloud storage buckets that aren't managed by Anyscale, configure permissions using the patterns below.
- AWS S3
- GCS
- Add the credentials as tracked environment variables under the Dependencies tab of a workspace. Anyscale automatically propagates them to all Ray workloads run through the workspace, job, or service.
- Then, use the bucket directly in your Ray app code, or similar.
ds = ray.data.read_parquet("s3://<your-bucket-name>/<path>")
- Add a path to the credential as tracked environment variables under the Dependencies tab of a workspace, using the variable name
GOOGLE_APPLICATION_CREDENTIALS
. An example value may be./google-service-account.json
. Anyscale automatically propagates them to all Ray workloads run through the workspace, job, or service. - Then, use the bucket directly in your Ray app code, or similar.
ds = ray.data.read_parquet("gcs://<your-bucket-name>/<path>")
To use private cloud storage buckets concurrently with $ANYSCALE_ARTIFACT_STORAGE
, pass the generated access keys or service account into the call to the cloud provider API directly, instead of setting them as process-wide global variables.
For large-scale workloads, consider co-locating compute and storage to avoid large data egress costs.
Contact Anyscale support for help with accessing your private bucket.
Choosing which storage to use
The choice of storage to use depends on your performance expectation, file sizes, and collaboration needs, security requirements, etc. Key considerations include:
- NFS is generally slower when a workload generates a large amount of disk IO for both read and write
- Don't put large files like datasets at terabyte scale in NFS storage. Use object storage, like an S3 bucket, for large files more than 10 GB.
- To share small files across different workspaces, jobs, or services, user and shared storage are good options.
- Use cluster storage,
mnt/cluster_storage
, if you are developing or iterating, for example, if you want to keep model weights loaded without having to set up object storage. However, for production or at high scale, use object storage.