Storage
While file storage is optimal for storing your code, AI workloads often need access to large amounts of data, whether it's data for training and fine tuning, or a common storage to store model checkpoints.
This guide describes:
- Different types of storage available on Anyscale
- How to access these different storage types
- How to configure them
Available storage options on Anyscale:
- Local storage for a node
- Object storage
- Network file system (NFS) shared across nodes
Local storage for a node
Anyscale provides each node with its own volume and disk and doesn’t share them with other nodes. This storage option enables higher access speed, lower latency, and scalability. Access local storage at /mnt/local_storage
. Normally, Anyscale deletes data in the local storage after instances are terminated. To provide a more seamless development workflow, Anyscale Workspace snapshots and persists your data in the project directory.
Anyscale supports the Non-Volatile Memory Express (NVMe) interface to access SSD storage volumes. This support provides additional temporary storage to the instances. See NVMe configuration for details on how to configure it.
Object storage
Anyscale requires users to set up a default object storage bucket during the deployment of each Anyscale CLoud. Users can choose to let Anyscale create a bucket or bring their own bucket. All workspace, job, and service clusters within an Anyscale cloud have permission to read and write to its default bucket.
Use the following environment variables to access the default bucket:
-
ANYSCALE_CLOUD_STORAGE_BUCKET
: the name of the default bucket for the cloud.- The bucket name format is
anyscale-production-data-{cloud_id}
if it is created by Anyscale. Cloud ID can be found in the cloud list page in the console UI. - The bucket name is defined by users during cloud deployment if users by their own bucket.
- The bucket name format is
-
ANYSCALE_CLOUD_STORAGE_BUCKET_REGION
: the region of the default bucket for the cloud. -
ANYSCALE_ARTIFACT_STORAGE
: the URI to the pre-generated path for storing your artifacts while keeping them separate from Anyscale-generated ones.- AWS:
s3://<bucket_name>/<org_id>/<cloud_id>/artifact_storage/
- GCP:
gs://<bucket_name>/<org_id>/<cloud_id>/artifact_storage/
- AWS:
Within the bucket, Anyscale stores managed data in the {organization_id}/
path. For cloud-specific managed data, Anyscale further groups together the data into an {organization_id}/{cloud_id}
path. Anyscale stores some log files in legacy folders.
Anyscale writes system- or user-generated files, for example, log files, to this bucket. Don't delete or edit the Anyscale-managed files, which may lead to unexpected data loss. Deleting the data degrades the Anyscale platform experience for features such as log viewing, log downloading, and others. Use $ANYSCALE_ARTIFACT_STORAGE to separate your files from Anyscale-generated ones.
Anyscale offers 100 GB of free object storage. If you need more storage, contact Anyscale support.
Storage shared across nodes
Anyscale mounts a Network File System (NFS) system automatically on workspace, job, and service clusters. Anyscale mounts 3 shared storage options by default for common permission groups.
/mnt/cluster_storage
is accessible to all nodes of a workspace, job, or service cluster./mnt/user_storage
is private to the Anyscale user but accessible from every node of all their workspace, job, and service clusters in the same cloud./mnt/shared_storage
is accessible to all Anyscale users of the same Anyscale cloud. Anyscale mounts it on every node of all the clusters in the same cloud.
NFS storage is accessible to all users on your Anyscale cloud. Don't put any sensitive data or secrets that you don't want users in your cloud to access.
Cluster storage
/mnt/cluster_storage
is a directory on NFS that Anyscale mounts on every node of the workspace, job, or service cluster and persists throughout the lifecycle of the cluster. This storage is useful for storing files that the head node and all the worker nodes need to access. For example:
- TensorBoard logs
- Common data files that all workers need to access with a stable path
Following are some behaviors to note about cluster storage:
- Anyscale doesn’t clone the cluster storage when you clone a workspace.
- New jobs and service updates launch new clusters.
/mnt/cluster_storage
doesn't persist in these cases.
User storage
/mnt/user_storage
is a directory on NFS specific to an Anyscale user. The user who creates the workspace, job, or service cluster can access this storage from every node. This storage is useful for storing files you need to use with multiple workspace, job, or service clusters.
Shared storage
/mnt/shared_storage
is a directory on NFS that all Anyscale users of the same Anyscale cloud can access. It's mounted on every node of every Anyscale cluster in the same cloud. This storage is useful for storing model checkpoints and other artifacts that you want to share with your team.
NFS storage usually has connection limits. Different cloud providers may have different limits. See Changing the default disk size for more information.
To increase the capacity of GCP Filestore instances, see the GCP documentation for more information.
Anyscale-hosted clouds use s3fs
to mount the shared storage.
Access local storage and storage shared across nodes (NFS-based)
You can only interact with local or NFS-based storage inside running Anyscale clusters.
- When using workspaces, the easiest way to transfer small files to and from is with the VS Code Web UI. Note: downloading a folder is not supported at the moment.
- For files in workspace's working directory, you can use workspace CLI to pull and push files.
- You can commit files to git and run
git pull
from a workspace, job, or service cluster. - For large files, use object storage, for example, Amazon S3 or Google Cloud Storage, and access the data from there.
Access object storage
Anyscale default storage bucket
Anyscale provides a default cloud storage path private to each cloud located at $ANYSCALE_ARTIFACT_STORAGE
. All nodes launched within your cloud should have access to read or write files at this path. You can also access object storage outside Anyscale as long as you configure the permissions correctly.
To copy files from your workspace cluster into cloud storage, you can use standard aws s3 cp
and gcloud storage cp
commands.
- Write to artifact storage
- Read from artifact storage
echo "hello world" > /tmp/input.txt
aws s3 cp /tmp/input.txt $ANYSCALE_ARTIFACT_STORAGE/saved.txt
aws s3 cp $ANYSCALE_ARTIFACT_STORAGE/saved.txt /tmp/output.txt
cat /tmp/output.txt
Anyscale scopes permissions on the cloud storage bucket backing $ANYSCALE_ARTIFACT_STORAGE
to only provide access to the specified path, so calls made to the root of the underlying bucket, for example, HeadObject
, may be rejected by Anyscale with an ACCESS_DENIED
error. Avoid making calls to any paths that don't explicitly have the $ANYSCALE_ARTIFACT_STORAGE/
prefix.
Private storage buckets
To access private cloud storage buckets that aren't managed by Anyscale, configure permissions using the patterns below.
Grant Anyscale clusters access to the private buckets
- access S3 bucket from Anyscale clouds on AWS
- access GCS bucket from Anyscale clouds on GCP
- access storage buckets from Anyscale clouds on Kubernetes
This approach doesn't work for Anyscale-hosted clouds. Contact Anyscale support for help with accessing your private buckets securely.
Load credentials from your secret manager
- Follow the instructions to grant Anyscale clusters access to your secret manager
- Load credentials to your private buckets from the secret manager.
(Not recommended) Store credentials as environment variables or in files
It is not secure to store credentials in files or code directly. If you have to, use them only for development.
- Store as environment variables
- Use AWS credential file or pass directly
- Use Google Cloud credential file
- Option 1: Add the credentials as tracked environment variables (for example,
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
) under the Dependencies tab of a workspace or in Job or Service APIs - Option 2: Bake the credentials as environment variables into container images.
Anyscale automatically propagates them to all Ray workloads run through the workspace, job, or service. Use the bucket directly in your Ray app code, or similar.
Access to $ANYSCALE_ARTIFACT_STORAGE
will be blocked as you override the default AWS/GCS credentials. To use private cloud storage buckets concurrently with $ANYSCALE_ARTIFACT_STORAGE
, pass the generated access keys or service account into the call to the cloud provider API directly, instead of setting them as process-wide environment variables.
Option 1: Use AWS credential file
Save your credentials into the AWS credential file
# ~/.aws/credentials
[myprofile]
aws_access_key_id = your_access_key
aws_secret_access_key = your_secret_key
AWS CLI can read credentials from the AWS profile you just created
aws s3 cp your_file.txt s3://bucket/path/ --profile myprofile
Option 2: Use AWS credentials in code
import boto3
# Explicitly provide credentials
s3_client = boto3.client(
's3',
aws_access_key_id='YOUR_ACCESS_KEY',
aws_secret_access_key='YOUR_SECRET_KEY',
aws_session_token='YOUR_SESSION_TOKEN' # optional, if using temporary credentials
)
# Specify your bucket, object key (path), and destination filename
bucket_name = 'my-dataset-bucket'
object_key = 'datasets/my_data.csv'
destination_file = 'my_data.csv'
# Download the file
s3_client.download_file(bucket_name, object_key, destination_file)
print(f"Downloaded {object_key} to {destination_file}")
Get your Google Cloud Service Account credential file first.
Google Cloud SDK uses the credential file to access your files.
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your-service-account-key.json"
# make sure you have installed Google Cloud SDK following https://cloud.google.com/sdk/docs/install
gsutil cp your_file.txt gs://bucket/path/
Alternatively, Use GCS client to load your credentials.
# make sure you have installed Google Cloud Storage Python SDK with `pip install google-cloud-storage`
from google.cloud import storage
# Explicitly specify your credentials
storage_client = storage.Client.from_service_account_json('path/to/credentials.json')
bucket_name = 'my-dataset-bucket'
blob_name = 'datasets/my_data.csv'
destination_file = 'my_data.csv'
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
blob.download_to_filename(destination_file)
print(f"Downloaded {blob_name} to {destination_file}")
Choosing which storage to use
The choice of storage to use depends on your performance expectation, file sizes, and collaboration needs, security requirements, etc. Key considerations include:
- NFS is generally slower when a workload generates a large amount of disk IO for both read and write
- Don't put large files like datasets at terabyte scale in NFS storage. Use object storage, like an S3 bucket, for large files more than 10 GB.
- To share small files across different workspaces, jobs, or services, user and shared storage are good options.
- Use cluster storage,
mnt/cluster_storage
, if you're developing or iterating, for example, if you want to keep model weights loaded without having to set up object storage. However, for production or at high scale, use object storage. - For large-scale workloads, consider co-locating compute and storage to avoid large data egress costs.