Skip to main content

Accessing a GCS Bucket

This page describes how to both directly interact with a GCS bucket from your Anyscale cluster (running on GCP) and how to configure runtime environments to work with GCS.

note

Determining the Cluster's Service Account

By default, Anyscale clusters run with a cloud-specific Service Account (instructions are here).

If you followed instructions on how to run with a custom Service Account, use that Service Account for the rest of the instructions.

Accessing Google Cloud Storage Directly from an Anyscale Cluster

To interact with a private Google Cloud Storage Bucket you need both permissions and tooling.

To grant your Service Account (either the Anyscale default Service Account or your own) access to a bucket, follow these instructions: (These instructions come from Google).

  1. Go the "Permissions" tab of the bucket
  2. Click "Add"
  3. Type the Service Account Email as a "New principal". If you are using the Anyscale default cloud-specific Service Account, you can find the Service Account Email in the Clouds table on the Configurations page in a column called Provider Identity.
  4. Select roles to grant to the Service Account. To give full R/W List access, grant your bucket Storage Object Admin and Storage Object Viewer.
  5. Click Save

Interacting with your bucket

To interact with a Bucket from the CLI, install gsutil. Running the following command will install gsutil on a node:

wget -qO- https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-359.0.0-linux-x86_64.tar.gz | tar xvz

Afterwards, you can interact with gsutil (e.g. copy a local file to the Bucket) as follows:

./google-cloud-sdk/bin/gsutil cp <file> gs://<bucket>
caution

If you install gs_util via pip (as is the case with runtime_environments), you may need to add the following to ~/.boto:

[GoogleCompute]
service_account = default

You can create this file by running printf "[GoogleCompute]\nservice_account = default\n" > ~/.boto

Using Local Directory with Anyscale Jobs and Services (on GCP)

With Anyscale Jobs and Services, you can set the working_dir option of the runtime_env to be a local directory. Follow the instructions below on how to set up permissions for accessing your Google Cloud Bucket.

Anyscale will upload your local directory to the specified remote storage and downloaded by the cluster before running the job. External storage allows for clusters to be restarted with your working_dir after the initial submission.

Instructions

Set up your environment

  1. Make sure you have gcloud installed. If you have a Mac, you can run brew install --cask google-cloud-sdk to install, otherwise follow the instructions in the link.
  2. Authenticate your computer with Google to allow uploading to your GCS bucket by running gcloud auth application-default login on your local terminal. A browser will open and will prompt you to sign-in with your Google account.

Configure Permissions

  1. Use an existing bucket on Google Cloud or create a new bucket. Your bucket can live in any Google Cloud project.
  2. Configure gcloud to use the same project as your bucket by running gcloud config set project <PROJECT_ID>
  3. Follow the directions from above to give your Anyscale Cluster permission to access your GCS bucket. You can use the Provider Identity found in the Clouds table on the Configurations page or your own Service Account.

Run a Job or Service

  1. You can now upload to your bucket by specifying upload_path and a local working_dir in the runtime_env of your Anyscale Job. You can find the upload_path for your Google Storage Bucket by navigating to Configuration and finding the row called gsutil URI. It should look something like gs://my-bucket. The runtime_env portion of your yaml should look similar to below:
runtime_env:
working_dir: "."
upload_path: "gs://my-test-bucket"