Access blob storage and ADLS
Access blob storage and ADLS
This page provides an overview of configuring access to Azure blob storage and Azure Data Lake Storage (ADLS).
How does Anyscale use Azure blob storage?
When you configure an Anyscale cloud resource on Azure Kubernetes Services (AKS), you configure a blob storage container as the default storage location for system files generated by Anyscale, such as logs and checkpoints.
The managed identity used by the Anyscale operator only needs access to the Azure storage account configured as the default storage location for your Anyscale cloud resource. All managed identities used by clusters must also have access to this storage account.
You can also configure Azure blob storage as a Persistent Volume Claim (PVC) in your AKS cluster and use this PVC for shared storage on Anyscale.
Most Anyscale workloads require read-only access to data in one or many locations, and then need to persist assets such as model weights, training checkpoints, and batch inference results to another location. Anyscale recommends configuring access to blob storage or ADLS using user-assigned managed identities and mapping these privileges to Anyscale workloads or users using cloud IAM mapping. You then access these storage locations using cloud URIs.
Configure access to Azure storage
Configure access to Azure storage by adding roles to the managed identity used by your cluster. See Configure managed identities for clusters on Anyscale on AKS.
You can configure access to the entire storage account or a container within the storage account. Anyscale recommends using the following roles to configure access to blob storage:
| Role | Permissions |
|---|---|
| Storage Blob Data Contributor | Read, write, and delete access for Azure blob containers and data. |
| Storage Blob Data Reader | Read-only access to Azure blob containers and data. |
You must have sufficient privileges in your Azure account to assign roles to managed identities. Azure provides many tools for managing resources, including the Azure Portal and CLI.
During development, you might choose to configure access to blob storage using a different method, such as access keys or SAS tokens. While this pattern might unblock users that don't have admin permissions to configure Anyscale IAM mapping, AKS service accounts, or Azure managed identities, Microsoft recommends using Entra ID to configure access. See the Azure docs page Authorize access to data in Azure Storage for all supported access patterns.
Example: Configure read-only access to a container
The following example configures read-only access to a container in a storage account using the Azure CLI.
Requirements
This example has the following requirements:
- You have permission to assign roles in the target resource group, such as Role Based Access Control Administrator.
- You have created a managed identity configured as a service account in your AKS cluster and configured this for use with your Anyscale clusters. See Configure managed identities for clusters on Anyscale on AKS.
- You have created an Azure blob storage or ADLS account and created a container in the account.
- You have configured the Azure CLI on your machine.
Step 1: Define variables for your Azure resources
Define the following variables for the Azure resources you need to configure relationships between.
export SUBSCRIPTION_ID="<your-subscription-id>"
export RESOURCE_GROUP="<your-resource-group-name>"
export STORAGE_ACCOUNT_NAME="<your-storage-account-name>"
export CONTAINER_NAME="<your-container-name>"
export MANAGED_ID_NAME="<your-user-assigned-managed-identity-name>"
This example assumes your storage account and managed identity are in the same resource group.
Step 2: Generate the principal ID and scope
Azure manages permissions by assigning roles to principals at a specified scope.
Run the following command to set the principal ID for your managed identity as an environment variable:
export PRINCIPAL_ID=$(az identity show \
--name "${MANAGED_ID_NAME}" \
--resource-group "${RESOURCE_GROUP}" \
--query principalId \
--output tsv)
Run the following command to define a scope for a blob container in your storage account:
export SCOPE="/subscriptions/${SUBSCRIPTION_ID}/resourcegroups/${RESOURCE_GROUP}/Microsoft.Storage/storageAccounts/${STORAGE_ACCOUNT_NAME}/blobServices/default/containers/${CONTAINER_NAME}"
Step 3: Assign the role to your managed identity
Run the following command to assign the Storage Blob Data Reader role to your managed identity scoped to a container:
az role assignment create \
--assignee "${PRINCIPAL_ID}" \
--role "Storage Blob Data Reader" \
--scope "${SCOPE}"
To assign permissions at the storage account level, shorten the scope to end with your storage account name.
Use the Storage Blob Data Contributor role to add read, write, and delete permissions.
Query data in blob storage
Anyscale recommends using Apache Arrow to interact with files in Azure blob storage and ADLS.
You create an AzureFileSystem object by specifying the account name for your Azure storage account. Arrow uses the managed identity of your cluster to authenticate access to Azure storage. This pattern works for both blob storage and ADLS.
from pyarrow.fs import AzureFileSystem
fs = AzureFileSystem(account_name="my_storage_account")
Your Anyscale cluster must use a managed identity with sufficient privileges to connect to the target Azure storage container. See Configure managed identities for clusters on Anyscale on AKS.
Example: List files in a container
The following example uses Apache Arrow to list the files in a blob storage container:
from pyarrow.fs import AzureFileSystem
fs = AzureFileSystem(account_name="my_storage_account")
fs.get_file_info(AzureFileSystem.FileSelector("my_container"))
Example: Read CSV data from blob storage
The following example uses Ray Data to load a CSV file from Azure blob storage:
import ray
from pyarrow.fs import AzureFileSystem
fs = AzureFileSystem(account_name="my_storage_account")
ds = ray.data.read_csv(
"container/path/to/file.csv",
filesystem = fs
)
print(ds.schema())
Example: Write parquet data to blob storage
The following example uses Ray to write a directory of parquet files from a Ray Data dataset:
import ray
from pyarrow.fs import AzureFileSystem
fs = AzureFileSystem(account_name="my_storage_account")
ds.write_parquet("container/path/to/directory",
filesystem=fs)
Troubleshooting
Scoping permissions at the container level doesn't grant access to other containers in the same storage account.
To prevent accidental deletion or modification of production data, use the Storage Blob Data Reader role for read-only access. Use the Storage Blob Data Contributor role when your workload writes, deletes, or updates data.
Some libraries might use different drivers to interact with Azure storage. Consult the documentation for your library or tool for instructions on accessing Azure storage.
Some drivers such as ABFS use slightly different syntax for blob storage and ADLS. The following command displays the enablement status of hierarchical namespaces:
az storage account show --name <storage-account-name> --resource-group <group-name> --query isHnsEnabled
It's possible to configure Azure storage to support anonymous public reads, allowing you to share public datasets. You might need to reformat provider Azure storage URIs to use them with Ray.
Common URI patterns for Azure storage include the following:
https://storage_account_name.blob.core.windows.net/container_name/path/to/file.csvabfss://container_name@storage_account_name.blob.core.windows.net/path/to/file.csvabfss://container_name@storage_account_name.dfs.core.windows.net/path/to/file.csv
Check the managed identity of your Anyscale cluster
Your Anyscale cloud owner might have configured cloud IAM mapping rules that govern how service accounts and managed identities map to workloads. These rules can vary by user, project, and workload type. See Anyscale cloud IAM mapping.
If you have trouble accessing data after adding a role to a managed identity, make sure you have configured the correct managed identity. This section provides instructions for finding the managed identity used by your Anyscale workload.
Anyscale clusters contain the client ID for the active managed identity as an environment variable. Run the following command from within your Anyscale workspace to display the client ID for your managed identity:
echo $AZURE_CLIENT_ID
To find the name of the managed identity associated with this client ID, run the following command:
az ad sp list --filter "appId eq '${AZURE_CLIENT_ID}'" --query "[].displayName" -o tsv
You can install the Azure CLI in your Anyscale workspace. See Azure docs for a one command install.
Run the following command to use your cluster managed identity to login to the CLI:
az login --identity