Storage

While file storage is optimal for storing your code, AI workloads often need access to large amounts of data, whether it's data for training and fine tuning, or a common storage to store model checkpoints.

This guide describes:

Different types of storage available on Anyscale
How to access these different storage types
How to configure them

Available storage options on Anyscale:

Local storage for a node
Object storage
Network file system (NFS)

Local storage for a node

Anyscale provides each node with its own volume and disk and doesn't share them with other nodes. This storage option enables higher access speed, lower latency, and scalability. Access local storage at /mnt/local_storage. Normally, Anyscale deletes data in the local storage after instances are terminated. To provide a more seamless development workflow, Anyscale workspaces snapshot and persist your data in the project directory.

Anyscale supports the Non-Volatile Memory Express (NVMe) interface to access SSD storage volumes. This support provides additional temporary storage to the instances. For details on how to configure the NVMe for your cloud, see Compute configurations.

Object storage

Anyscale requires users to set up a default object storage bucket during the deployment of each Anyscale cloud. Users can choose to let Anyscale create a bucket or bring their own bucket. All workspace, job, and service clusters within an Anyscale cloud have permission to read and write to its default bucket.

Use the following environment variables to access the default bucket:

ANYSCALE_CLOUD_STORAGE_BUCKET: the name of the default bucket for the cloud.
- The bucket name format is anyscale-production-data-{cloud_id} if it is created by Anyscale. Cloud ID can be found in the cloud list page in the console UI.
- The bucket name is defined by users during cloud deployment if users by their own bucket.
ANYSCALE_CLOUD_STORAGE_BUCKET_REGION: the region of the default bucket for the cloud.
ANYSCALE_ARTIFACT_STORAGE: the URI to the pre-generated path for storing your artifacts while keeping them separate from Anyscale-generated ones.
- AWS: s3://<bucket_name>/<org_id>/<cloud_id>/artifact_storage/
- GCP: gs://<bucket_name>/<org_id>/<cloud_id>/artifact_storage/

Within the bucket, Anyscale stores managed data in the {organization_id}/ path. For cloud-specific managed data, Anyscale further groups together the data into an {organization_id}/{cloud_id} path. Anyscale stores some log files in legacy folders.

caution

Anyscale writes system- or user-generated files, for example, log files, to this bucket. Don't delete or edit the Anyscale-managed files, which may lead to unexpected data loss. Deleting the data degrades the Anyscale platform experience for features such as log viewing, log downloading, and others. Use $ANYSCALE_ARTIFACT_STORAGE to separate your files from Anyscale-generated ones.

Anyscale-hosted cloud users only

Anyscale offers 100 GB of free object storage. If you need more storage, contact Anyscale support.

Storage shared across nodes

Anyscale mounts three shared storage locations to each Anyscale cluster. These shared storage locations use the default object storage location configured for your Anyscale cloud.

The following table contains the paths for these locations and describes their scoping and example uses. Note that the indicated scope only refers to how the path maps back to a dedicated directory within the default storage location. Because all clusters in an Anyscale cloud have read and write access to this location, all files and data stored in shared storage locations are discoverable and readable by all users in an Anyscale cloud.

note

Anyscale previously recommended deploying all clouds with a Network File System (NFS), and continues to support optional NFS configuration. Clouds deployed with NFS continue to use it for some storage tasks, including shared storage locations.

Path	Scoping	Description
`/mnt/cluster_storage`	The lifecycle of an Anyscale cluster.	Use this location to store files that all nodes in the cluster might need to access, such as TensorBoard logs or data files downloaded from the internet.
`/mnt/user_storage`	The ID of the user that launched the cluster.	Use this location to store files that you need to use with multiple workspaces, jobs, or services, such as custom init scripts or commonly used Python dependency management files.
`/mnt/shared_storage`	The Anyscale cloud deployment.	Use this location to store files that you want to share with other users in your Anyscale cloud, such as model checkpoints.

note

Anyscale workspaces maintain the link to data stored in /mnt/cluster_storage between cluster restarts. When you clone a workspace, Anyscale doesn't clone this directory.

Anyscale service updates and job runs launch new clusters, and files in /mnt/cluster_storage don't persist.

Access local storage and storage shared across nodes (NFS-based)

You can only interact with local or NFS-based storage inside running Anyscale clusters.

When using workspaces, the easiest way to transfer small files to and from is with the VS Code Web UI. Note: downloading a folder is not supported at the moment.
- For files in workspace's working directory, you can use workspace CLI to pull and push files.
You can commit files to git and run git pull from a workspace, job, or service cluster.
For large files, use object storage, for example, Amazon S3 or Google Cloud Storage, and access the data from there.

Access object storage

Anyscale default storage bucket

Anyscale provides a default cloud storage path private to each cloud located at $ANYSCALE_ARTIFACT_STORAGE. All nodes launched within your cloud should have access to read or write files at this path. You can also access object storage outside Anyscale as long as you configure the permissions correctly.

To copy files from your workspace cluster into cloud storage, you can use standard aws s3 cp and gcloud storage cp commands.

Write to artifact storage
Read from artifact storage

echo "hello world" > /tmp/input.txt
aws s3 cp /tmp/input.txt $ANYSCALE_ARTIFACT_STORAGE/saved.txt

aws s3 cp $ANYSCALE_ARTIFACT_STORAGE/saved.txt /tmp/output.txt
cat /tmp/output.txt

warning

Anyscale scopes permissions on the cloud storage bucket backing $ANYSCALE_ARTIFACT_STORAGE to only provide access to the specified path, so calls made to the root of the underlying bucket, for example, HeadObject, may be rejected by Anyscale with an ACCESS_DENIED error. Avoid making calls to any paths that don't explicitly have the $ANYSCALE_ARTIFACT_STORAGE/ prefix.

Private storage buckets

To access private cloud storage buckets that aren't managed by Anyscale, configure permissions using the patterns below.

Grant Anyscale clusters access to the private buckets

Anyscale-hosted Cloud users

This approach doesn't work for Anyscale-hosted clouds. Contact Anyscale support for help with accessing your private buckets securely.

Load credentials from your secret manager

Follow the instructions to grant Anyscale clusters access to your secret manager. See Secret management on Anyscale.
Load credentials to your private buckets from the secret manager.

(Not recommended) Store credentials as environment variables or in files

warning

It is not secure to store credentials in files or code directly. If you have to, use them only for development.

Store as environment variables
Use AWS credential file or pass directly
Use Google Cloud credential file

Option 1: Add the credentials as tracked environment variables (for example, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY) under the Dependencies tab of a workspace or in Job or Service APIs
Option 2: Bake the credentials as environment variables into container images.

Anyscale automatically propagates them to all Ray workloads run through the workspace, job, or service. Use the bucket directly in your Ray app code, or similar.

warning

Access to $ANYSCALE_ARTIFACT_STORAGE will be blocked as you override the default AWS/GCS credentials. To use private cloud storage buckets concurrently with $ANYSCALE_ARTIFACT_STORAGE, pass the generated access keys or service account into the call to the cloud provider API directly, instead of setting them as process-wide environment variables.

Option 1: Use AWS credential file

Save your credentials into the AWS credential file

  # ~/.aws/credentials
  [myprofile]
  aws_access_key_id = your_access_key
  aws_secret_access_key = your_secret_key

AWS CLI can read credentials from the AWS profile you just created

  aws s3 cp your_file.txt s3://bucket/path/ --profile myprofile

Option 2: Use AWS credentials in code

import boto3

# Explicitly provide credentials
s3_client = boto3.client(
    's3',
    aws_access_key_id='YOUR_ACCESS_KEY',
    aws_secret_access_key='YOUR_SECRET_KEY',
    aws_session_token='YOUR_SESSION_TOKEN'  # optional, if using temporary credentials
)

# Specify your bucket, object key (path), and destination filename
bucket_name = 'my-dataset-bucket'
object_key = 'datasets/my_data.csv'
destination_file = 'my_data.csv'

# Download the file
s3_client.download_file(bucket_name, object_key, destination_file)

print(f"Downloaded {object_key} to {destination_file}")

Get your Google Cloud Service Account credential file first.

Google Cloud SDK uses the credential file to access your files.

  export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your-service-account-key.json"
  # make sure you have installed Google Cloud SDK following https://cloud.google.com/sdk/docs/install 
  gsutil cp your_file.txt gs://bucket/path/

Alternatively, Use GCS client to load your credentials.

# make sure you have installed Google Cloud Storage Python SDK with `pip install google-cloud-storage`
from google.cloud import storage

# Explicitly specify your credentials
storage_client = storage.Client.from_service_account_json('path/to/credentials.json')

bucket_name = 'my-dataset-bucket'
blob_name = 'datasets/my_data.csv'
destination_file = 'my_data.csv'

bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)

blob.download_to_filename(destination_file)

print(f"Downloaded {blob_name} to {destination_file}")

Choosing which storage to use

The choice of storage to use depends on your performance expectation, file sizes, and collaboration needs, security requirements, etc. Key considerations include:

NFS is generally slower when a workload generates a large amount of disk IO for both read and write
Don't put large files like datasets at terabyte scale in NFS storage. Use object storage, like an S3 bucket, for large files more than 10 GB.
To share small files across different workspaces, jobs, or services, user and shared storage are good options.
Use cluster storage, mnt/cluster_storage, if you're developing or iterating, for example, if you want to keep model weights loaded without having to set up object storage. However, for production or at high scale, use object storage.
For large-scale workloads, consider co-locating compute and storage to avoid large data egress costs.

Local storage for a node​

Object storage​

Storage shared across nodes​

Access local storage and storage shared across nodes (NFS-based)​

Access object storage​

Anyscale default storage bucket​

Private storage buckets​

Grant Anyscale clusters access to the private buckets​

Load credentials from your secret manager​

(Not recommended) Store credentials as environment variables or in files​

Choosing which storage to use​

Local storage for a node

Object storage

Storage shared across nodes

Access local storage and storage shared across nodes (NFS-based)

Access object storage

Anyscale default storage bucket

Private storage buckets

Grant Anyscale clusters access to the private buckets

Load credentials from your secret manager

(Not recommended) Store credentials as environment variables or in files

Choosing which storage to use