Deploy an Anyscale Cloud on GCP
Before you run Ray workloads on Anyscale, an Anyscale organization member must deploy an Anyscale Cloud within a Google Cloud Platform (GCP) environment. This integration enables Anyscale to manage resources like compute instances and storage directly in a GCP project.
1. Prerequisites
- Create a Google Cloud project for Anyscale to operate in.
- Install the Google Cloud CLI.
- Optional: Authenticate the Google Cloud CLI in the project.
2. Install the Anyscale CLI
- Run the following command to install the Anyscale CLI and Python client package:
pip install -U "anyscale[gcp]"
- To authenticate your credentials, run the following command, which fetches and updates the token that confirms your identity in the
~/.anyscale/credentials.json
file.
anyscale login
If necessary, log in to the Anyscale console to complete authentication.
3. Choose an Anyscale Cloud deployment method
Deploying an Anyscale Cloud integrates Anyscale's capabilities into your GCP project to leverage its compute, storage, and networking resources for scalable, distributed computing.
You can use one of two different deployment methods that use the Anyscale CLI for Cloud configuration. Choose a method based on your organization's requirements.
anyscale cloud setup
- Use for rapid deployment and a straightforward, low-maintenance solution; deploy in public subnets and access over public IP addresses without setting up additional networking infrastructure.anyscale cloud register
- Suited for teams with advanced cloud expertise, seeking enhanced security, custom private networking, and specific compliance needs.
4. Create an Anyscale Cloud
Based on the deployment method selected from the previous step, create an Anyscale Cloud with the following instructions.
- ⚡ (Auto) cloud setup
- 🎛️ (Custom) cloud register
For the anyscale cloud setup
deployment method, Anyscale automatically creates and configures the necessary resources within your GCP project. You deploy Ray clusters in public subnets and access them using public IP addresses without needing to set up additional networking infrastructure like VPNs.
Note: To manually customize resources, use the (Custom) cloud register method instead.
Step 1: Verify permissions in the GCP project
Go to your role in the GCP project and ensure that your user account has theiam.serviceAccounts.setIamPolicy
permission enabled so that Anyscale can grant Google Cloud Deployment Manager the permission to set IAM policies.Step 2: Enable Google Cloud APIs
Enable the Cloud Resource Manager API so that Anyscale can enable all other required APIs.Step 3: Deploy a new Cloud
Run the following command to deploy a new Cloud:
anyscale cloud setup
--provider gcp
--name ANYSCALE_CLOUD_NAME
--region GCP_COMPUTE_REGION
--project-id GCP_PROJECT_ID
--enable-head-node-fault-tolerance
--enable-head-node-fault-tolerance
: Enables head node fault tolerance in Anyscale Services by configuring an additional Memorystore instance for the Ray Global Control Store. Note that this flag extends the setup time by approximately 10 minutes.
For the anyscale cloud register
deployment method, you are responsible for creating and configuring GCP resources needed to integrate with Anyscale. You define subnets to deploy Ray clusters and access them using public or private IP addresses.
This custom-defined networking requires you to configure the network paths between users, clusters, and the Anyscale Control Plane. Connectivity and network performance between users and clusters depends on your setup.
Note: Due to limited support for certificates on GCP internal load balancers, Anyscale Services created for private network clouds are exposed over http
rather than https
.
Step 1: Choose method for creating cloud resources
You have three methods for creating custom GCP infrastructure resources to connect to Anyscale:
- (Recommended) Anyscale Provided Terraform module
- Create your own Terraform module
- Create resources manually in the GCP Console
Use this predefined set of configurations developed by Anyscale, which simplifies the setup process. Applying this module to your cloud environment configures the required resources in your GCP project.
For details and instructions on using this module, see the following resources:
You can create custom Terraform modules to tailor cloud resources and configurations to meet compliance requirements.
You can manually create resources in the GCP Console, which offers maximum customization but can be prone to manual errors.
Step 2: Create cloud resources
Cloud resources created in a GCP project must meet a list of minimum requirements to work with Anyscale.
Following the Anyscale Terraform getting started guide satisfies these requirements by default. For all other methods, perform the following steps:
GCP APIs
Enable the following Google Cloud APIs in the GCP project you use to host the Anyscale Cloud. See How to enable an API in your Google Cloud project.
API Name | Service Name | Description |
---|---|---|
Compute Engine API | compute.googleapis.com | Manages VM instances and other Compute Engine resources. |
Cloud Filestore API | file.googleapis.com | Manages file storage for VM instances. |
Cloud Storage API | storage-component.googleapis.com, storage.googleapis.com | Manages object storage for storing and accessing data. |
Certificate Manager API | certificatemanager.googleapis.com | Manages SSL/TLS certificates and related settings. |
Deployment Manager API | deploymentmanager.googleapis.com | Manages infrastructure deployment and calls other Google APIs. |
Cloud Resource Manager API | cloudresourcemanager.googleapis.com | Manages Google Cloud resources like projects, folders, etc. |
Service Usage API | serviceusage.googleapis.com | Manages Google Cloud service usage. |
Cloud Memorystore for Redis API | redis.googleapis.com | Manages in-memory data store services on Google Cloud. Optional for enabling head node fault tolerance in services. |
Service accounts
Anyscale uses this to manage GCE instances and Ray clusters in your GCP project.
- The service account must be the
owner
oreditor
role on the project. - The principal must be the
Service Account Token Creator
role on the service account itself.
This is the default account attached to Ray clusters, modifiable for specific workload needs.
- The service account must have read, write, and list permissions on the cloud storage bucket. The broadest way to grant permissions implicitly is to assign the
Storage Admin
role to both service accounts in the bucket policy. - To use custom docker environments, the service account must have the
Artifact Registry Reader
role.
Workload identity federation
- Navigate to IAM & Admin in GCP and select Workload Identity Pools.
- Create a pool, name it, and note down the name.
- Within the identity pool, create a new provider.
- Choose AWS as the external identity provider.
- Input Anyscale's control plane AWS account number as the
AWS account ID
. You can obtain this number from Anyscale support.
- Within the workload identity provider, add the following mappings:
- Add an attribute condition to restrict access to an organization-specific AWS IAM role in Anyscale's AWS account. Find your organization ID on the admin page.
{
"google.subject": "assertion.arn",
"attribute.aws_role": "assertion.arn.contains('assumed-role') ? assertion.arn.extract('{account_arn}assumed-role/') + 'assumed-role/' + assertion.arn.extract('assumed-role/{role_name}/') : assertion.arn",
"attribute.arn": "assertion.arn"
},
google.subject.startsWith("arn:aws:sts::<Anyscale AWS #>:assumed-role/gcp_if_<Organization ID>")
- In IAM & Admin, navigate to Service Accounts.
- Select the Anyscale Access service account.
- Add an IAM binding with the role to the access service account:
{
"role": "roles/iam.workloadIdentityUser",
"members": [
"principalSet://iam.googleapis.com/projects/<PROJECT_NUMBER>/locations/global/workloadIdentityPools/<POOL_NAME>/attribute.role_name/arn:aws:sts::<Anyscale AWS #>:assumed-role/gcp_if_<Organization ID>"
]
}
VPC and subnets
- Go to the VPC network section in the GCP console.
- Select either custom or auto mode for your VPC creation.
- Anyscale requires exactly one subnet to launch instances.
- Create a subnet within your VPC.
- Ensure the subnet CIDR range is at least /24. A range of /20 or greater is preferable.
- For valid GCP IPv4 ranges, see the GCP documentation.
- By default, subnets are public with a route to the internet.
- For a private subnet, use the
--private-network
flag in the cloud registry for Customer Defined Resources.
- If you plan to use Anyscale Services on a private network, create a proxy-only subnet.
- This subnet must be in the same region as your cloud.
Firewall policy
Anyscale requires exactly one firewall policy associated with the VPC. This policy can be either global or regional and should include the following rules:
- Allow ingress from
0.0.0.0/0
on ports 22 and 443. Note: You can scope down the0.0.0.0/0
range for enhanced security. - Allow all traffic within the same subnet.
Filestore
Configure your Filestore to meet the following requirements:
- Create a Filestore instance in the same VPC as the cloud.
- Recommended: Create the Filestore instance in the same region as the cloud for optimal performance.
- Recommended: Select a service tier suitable for anticipated scaling. Note: While you can increase Filestore capacity post-creation, the chosen service tier limits this capacity.
Cloud storage bucket
Grant both the Anyscale Access service account and the Ray Cluster Node service account the following permissions to the bucket:
storage.buckets.get
: Allows getting bucket information.storage.objects.[get,list,create]
: Permits listing and creating objects in a bucket.storage.multipartUploads.[create,listParts,abort]
: Enables managing multipart uploads.
Tip: The broadest role which encompasses these permissions is to assign the Storage Admin
(roles/storage.admin
) role to both service accounts in the bucket policy.
- Set up CORS as follows to enable using the Anyscale frontend to view data in the bucket:
[
{
"origin": ["https://*.anyscale.com"],
"method": ["GET", "PUT", "POST", "HEAD", "DELETE"],
"responseHeader": ["*"],
"maxAgeSeconds": 3600
}
]
- Create the bucket in the same region as the cloud for reduced latency.
- Configure the bucket to block all public access for security.
- Use the default unified access control for simplified permissions management.
Anyscale does not assume responsibility for data loss. Implement GCS object versioning and configure lifecycle management policies for data retention. See the GCS Documentation on Object Versioning and Object Lifecycle for more details.
Optional: Memorystore instance
Configure your Memorystore instance to meet the following requirements:
- Select the Standard tier with at least 5 GB memory.
- Under the Configure read replicas dropdown, select at least one read replica. The instance should be in the same VPC as configured for the cloud.
- Set maxmemory policy to
allkeys-lru
. - Disable Transport Layer Security (TLS) on the instance.
Step 3: Register Anyscale Cloud on GCP
After setting up the necessary resources, use the following command to register your Anyscale Cloud on GCP:
anyscale cloud register
--provider gcp
--name ANYSCALE_CLOUD_NAME
--region GCP_COMPUTE_REGION
--project-id GCP_PROJECT_ID
--vpc-name VPC_NAME
--subnet-names SUBNET_NAME
--file-storage-id FILESTORE_INSTANCE_ID
--filestore-location FILESTORE_LOCATION
--anyscale-service-account-email ANYSCALE_SERVICE_ACCOUNT_EMAIL
--instance-service-account-email INSTANCE_SERVICE_ACCOUNT_EMAIL
--firewall-policy-names FIREWALL_POLICY_NAME
--cloud-storage-bucket-name GCS_BUCKET_NAME
--provider-name PROVIDER_NAME
--memorystore-instance-name MEMORYSTORE_NAME
: Enables head node fault tolerance in Anyscale Services. Must configure Memorystore instance according to detailed resource requirements listed in the previous step.
--private-network
: Enables private networking on private subnets and IP addresses.
--functional-verify workspace
: Launches a test Workspace to verify validity of resources.
--functional-verify service
: Launches a test Service to verify validity of resources.
5. Verify Cloud resources
To ensure that your Anyscale Cloud works as expected, launch a test Workspace or Service from your Cloud. Run the following CLI command to trigger a functional verification:
anyscale cloud verify --name CLOUD_NAME --functional-verify workspace,service
Glossary of cloud resources
-
- Anyscale Access service account: Used by Anyscale to manage Google Compute Engine (GCE) instances and Ray clusters within your GCP project.
- Ray Cluster service account: The default service account attached to Ray clusters. You can customize this account to meet the specific needs and permissions required by your workload.
-
Workload identity federation: A mechanism that allows Anyscale to exchange control plane AWS role credentials for time-bound GCP service account credentials.
-
- Virtual Private Cloud (VPC): A virtual network within your GCP project. Anyscale deploys each Cloud in its own VPC, providing logical isolation from others.
- Subnets: Ranges of IP addresses within your VPC. GCP resources like GCE VM instances are attached to these subnets. Anyscale deploys workloads within these defined VPCs and subnets.
-
- Collections of firewall rules that secure the cloud environment by controlling incoming and outgoing traffic to GCP resources. Anyscale requires specific firewall rules to enable access to its suite of components and applications, such as Jupyter Labs, Ray Dashboard, Ray Serve endpoints, and Workspaces.
-
- A cloud-based network file system compatible with other GCP services. It offers shared storage, scalable performance, and compliance with regulatory standards. Required for Anyscale Workspaces, Filestore supports applications and workloads that need shared storage.
-
- A scalable and secure object storage service. Anyscale uses the specified cloud storage bucket for various functions supporting the management of Ray clusters and Ray applications. This includes general data storage persisting beyond the cluster lifespan and storing model checkpoints for tools like Ray Tune or RLlib.
- The bucket is named in this format
anyscale-production-data-{cloud_id}
but this can be customized by the user if they choose to bring their own bucket. Within the bucket, Anyscale managed data will be stored in the{organization_id}/
folder. For cloud-specific managed data, we further group together the data into a{organization_id}/{cloud_id}
folder. There are still some legacy folders where Anyscale managed data is stored detailed below. - Avoid modifying or deleting the data that Anyscale manages and stores on behalf of the customer. If the data is deleted, the Anyscale platform will have a degraded experience for features such as log viewing, log downloading, and others.
- Logs are stored in
{organization_id}/{cloud_id}/logs
and/logs
folders. The/logs
folder is a legacy location and we plan to migrate all logs to the{organization_id}/{cloud_id}/logs
folder. Here, we store all job logs, Web Terminal command logs, and ray logs. For performance reasons, we store logs in various formats for different use cases. For example, when streaming logs, we may produce many small files to allow for fresher data to be downloaded by the user.
Appendix: Minimal IAM Permissions for cloud commands
This section provides the minimal IAM permissions required for the Anyscale CLI to perform cloud operations. As an GCP administrator, follow these steps to apply the policy:
-
Create a new custom role or edit an existing role to include the following permissions.
- See GCP documentation on Create and manage custom roles for more information.
-
Grant the role to the service account that will be used to run the Anyscale CLI.
- See GCP documentation on Manage access to projects, folders, and organizations for more information.
- ⚡ (Auto) cloud setup
- 🎛️ (Custom) cloud register
certificatemanager.certmapentries.delete
certificatemanager.certmapentries.get
certificatemanager.certmapentries.list
certificatemanager.certmaps.delete
certificatemanager.certmaps.get
certificatemanager.certmaps.list
certificatemanager.certs.delete
certificatemanager.certs.get
certificatemanager.certs.list
compute.firewallPolicies.get
compute.firewallPolicies.update
compute.firewallPolicies.use
compute.globalOperations.get
compute.networks.setFirewallPolicy
deploymentmanager.deployments.create
deploymentmanager.deployments.get
deploymentmanager.deployments.update
deploymentmanager.manifests.get
deploymentmanager.operations.get
file.instances.get
iam.workloadIdentityPoolProviders.create
iam.workloadIdentityPools.create
iam.workloadIdentityPools.get
resourcemanager.projects.get
resourcemanager.projects.getIamPolicy
resourcemanager.projects.setIamPolicy
servicemanagement.services.bind
serviceusage.operations.get
serviceusage.services.enable
serviceusage.services.get
certificatemanager.certmapentries.delete
certificatemanager.certmapentries.get
certificatemanager.certmapentries.list
certificatemanager.certmaps.delete
certificatemanager.certmaps.get
certificatemanager.certmaps.list
certificatemanager.certs.delete
certificatemanager.certs.get
certificatemanager.certs.list
compute.firewallPolicies.get
iam.serviceAccounts.getIamPolicy
iam.serviceAccounts.get
storage.buckets.getIamPolicy
compute.subnetworks.get
file.instances.get
compute.networks.get
resourcemanager.projects.getIamPolicy
resourcemanager.projects.get
redis.instances.get
storage.buckets.get
serviceusage.operations.get
serviceusage.services.enable
serviceusage.services.get