Monitoring

Check your docs version

This version of the Anyscale docs is deprecated. Go to the latest version for up to date information.

When running Anyscale workloads in production, we recommend shipping logs & metrics to a third-party provider to enable rich querying, filtering, and alerting capabilities. In these docs, we'll walk through how to set up these integrations by installing a third-party monitoring tool (Vector) into the Ray container.

info

The following guide requires creating a new cluster environment: either with an Anyscale-provided Docker image, or a Bring your own Docker environment.

Step 0: Requirements

Vector is a tool for building observability pipelines. It accepts a configuration file that defines a data pipeline consisting of sources, transforms, and sinks. It supports many common third-party monitoring solutions. In this guide, we will use Vector to scrape logs & metrics from Ray, and ship them to a location of your choice.
SupervisorD is a process control system. In this guide, we will use SupervisorD to manage the Vector process.

Step 1: Write a Vector Configuration File

To write and test a configuration file, we recommend using an Anyscale workspace. Start a workspace, and open VS Code (as a text editor).

A Vector configuration file is a directed graph, consisting of one or more sources, transforms, and sinks. Below, we walk through how to build a configuration file to ship Ray logs & metrics to a few third-party providers supported by Vector.

Create a file called vector.yaml, and paste the following configuration in.

Source/Transform Configuration

vector.yaml
sources:
  raw_ray_logs:
    type: file
    fingerprint:
      ignored_header_bytes: 0
      strategy: device_and_inode
    include:
      - /tmp/ray/*/logs/**/job-driver-*.*
      - /tmp/ray/*/logs/**/runtime_env_setup-*.*
      - /tmp/ray/*/logs/**/worker-*.out
      - /tmp/ray/*/logs/**/worker-*.err
      - /tmp/ray/*/logs/**/serve/*.*
    exclude:
      # The session_latest directory is a symlink to an actual session directory,
      # so we intentionally exclude it here so Vector doesn't ingest duplicates.
      - /tmp/ray/session_latest/logs/**/*.*
  raw_ray_metrics:
    type: prometheus_scrape
    endpoints:
      - "${ANYSCALE_RAY_METRICS_ENDPOINT}"
    instance_tag: ScrapeTarget
    scrape_interval_secs: 15

# These transforms add useful attributes to your log files. To use other environment variables,
# see https://docs.anyscale.com/reference/environment-variables for all available options.
transforms:
  ray_logs:
    type: remap
    inputs: ["raw_ray_logs"]
    source: |-
      .cluster_id = "${ANYSCALE_CLUSTER_ID}"
      .instance_id = "${ANYSCALE_INSTANCE_ID}"
      .node_ip = "${ANYSCALE_NODE_IP}"
  ray_metrics:
    type: remap
    inputs: ["raw_ray_metrics"]
    source: |-
      .tags.cluster_id = "${ANYSCALE_CLUSTER_ID}"
      .tags.instance_id = "${ANYSCALE_INSTANCE_ID}"
      .tags.node_ip = "${ANYSCALE_NODE_IP}"
      .tags = compact(.tags, recursive: true)

Sink Configuration

Then, choose one of the sinks below, and add it to vector.yaml.

AWS CloudWatch
Google Cloud Monitoring
Datadog Logs & Metrics

AWS CloudWatch requires additional access for the Cluster IAM role. This can be modified in the AWS IAM Console. Make sure to replace YOUR_ACCOUNT_ID with your AWS Account ID.

IAM Cloudwatch Policy
{
  "Statement": [
    {
      "Action": "cloudwatch:PutMetricData",
      "Effect": "Allow",
      "Resource": "*",
      "Sid": "CloudwatchMetricsWrite"
    },
    {
      "Action": ["logs:DescribeLogStreams", "logs:DescribeLogGroups"],
      "Effect": "Allow",
      "Resource": "*",
      "Sid": "CloudwatchLogsRead"
    },
    {
      "Action": "logs:PutLogEvents",
      "Effect": "Allow",
      "Resource": "arn:aws:logs:*:YOUR_ACCOUNT_ID:log-group:/anyscale*:*",
      "Sid": "CloudwatchLogsEventsWrite"
    },
    {
      "Action": ["logs:CreateLogStream", "logs:CreateLogGroup"],
      "Effect": "Allow",
      "Resource": "arn:aws:logs:*:YOUR_ACCOUNT_ID:log-group:/anyscale*",
      "Sid": "CloudwatchLogsWrite"
    }
  ],
  "Version": "2012-10-17"
}

Once the IAM Role has been updated, update vector.yaml to include a sink section as follows:

vector.yaml
sinks:
  cloudwatch_logs:
    region: us-west-2
    encoding:
      codec: json
    group_name: "/anyscale/"
    inputs: ["ray_logs"]
    # One of ANYSCALE_JOB_ID / ANYSCALE_SERVICE_ID will be set for jobs / services.
    stream_name: "${ANYSCALE_JOB_ID}${ANYSCALE_SERVICE_ID}/${ANYSCALE_SESSION_ID}"
    type: aws_cloudwatch_logs
  cloudwatch_metrics:
    region: us-west-2
    default_namespace: anyscale
    inputs: ["ray_metrics"]
    type: aws_cloudwatch_metrics

GCP Logging and Cloud Monitoring require additional roles to be added to the Anyscale Cluster Principal. This can be modified in the GCP IAM Console. You will need to add:

roles/logging.logWriter
roles/monitoring.metricWriter

Once the IAM Principal has been updated, update vector.yaml to include a sink section as follows:

vector.yaml
sinks:
  gcp_logs:
    encoding:
      timestamp_format: rfc3339
    inputs: ["ray_logs"]
    log_id: anyscale.ray
    project_id: INSERT_GOOGLE_PROJECT_ID_HERE
    resource:
      project_id: INSERT_GOOGLE_PROJECT_ID_HERE
      type: global
    type: gcp_stackdriver_logs

  gcp_metrics:
    inputs: ["ray_metrics"]
    project_id: INSERT_GOOGLE_PROJECT_ID_HERE
    resource:
      project_id: INSERT_GOOGLE_PROJECT_ID_HERE
      type: global
    type: gcp_stackdriver_metrics

warning

Putting API Keys into a Docker image is not considered best practice - we recommend using Vector secrets in production.

Datadog offers different sites around the world. Make sure to identify your Datadog Site and update the example YAML definition.

vector.yaml
sinks:
  datadog_logs:
    default_api_key: INSERT_DATADOG_API_KEY_HERE
    inputs: ["ray_logs"]
    site: us5.datadoghq.com # Put your specific Datadog Site here
    type: datadog_logs
  datadog_metrics:
    default_api_key: INSERT_DATADOG_API_KEY_HERE
    inputs: ["ray_metrics"]
    site: us5.datadoghq.com # Put your specific Datadog Site here
    type: datadog_metrics

Step 2: Test the Configuration File

Save the Vector configuration above in a file in your local directory (for example, vector.yaml). Then, run the following commands:

# Install Vector.
sudo apt-get install curl -y
curl --proto '=https' --tlsv1.2 -sSfL https://sh.vector.dev | bash
source /home/ray/.profile

# Create a state directory for Vector & make it accessible.
sudo mkdir -p /var/lib/vector/
sudo chmod 777 /var/lib/vector/

# Run Vector
vector --config vector.yaml

# In a new tab, generate fake log content.
mkdir -p /tmp/ray/session_fake/logs/
for i in {1..5000}; do echo "Log Line $i" >> /tmp/ray/session_fake/logs/job-driver-fake.log && echo "Wrote line $i" && sleep 1; done

# Look for warnings / errors in Vector - if you don't see any, check upstreams to see if logs & metrics are being received.

Step 3: Move to Production

To move to production, we will first need to build a SupervisorD file, so that we can configure the Vector process to run automatically on cluster startup & in a process manager (so it will be restarted on failure). Let's create a file like the one below at supervisord.conf in the same workspace as earlier.

supervisord.conf
[program:vector]
user=ray
command=bash --login -c -i "sudo -E /home/ray/.vector/bin/vector --config=/etc/vector/vector.yaml"
autostart=true
autorestart=true
startsecs=0
startretries=50
stdout_logfile=/tmp/ray/vector.log
redirect_stderr=true

Then, follow the instructions below to package both of these configuration files into a Ray container image.

Bring your own Docker
Use Anyscale-provided Docker

On your laptop (or wherever you build your Dockerfile), change directory into the directory with your Dockerfile in it.
Copy vector.yaml from your Workspace into this directory.
Copy supervisord.conf from your Workspace into this directory.
Add the following lines to your Dockerfile.

# Install Vector.
RUN curl --proto '=https' --tlsv1.2 -sSfL https://sh.vector.dev | bash -s -- -y

# Write the Vector config.
RUN sudo mkdir -p /etc/vector/
RUN chmod 777 /etc/vector/
COPY vector.yaml /etc/vector/vector.yaml

# Write the SupervisorD config.
RUN sudo mkdir -p /etc/supervisor/customer.conf.d/
RUN chmod 777 /etc/supervisor/customer.conf.d/
COPY supervisord.conf /etc/supervisor/customer.conf.d/vector.conf

Build & push your Docker image, create a cluster environment with this Docker image, and start an Anyscale Job or Service.

On your Workspace, run cat vector.yaml | base64 -w 0. This is the serialized Vector config.
On your Workspace, run cat supervisord.conf | base64 -w 0. This is the serialized SupervisorD config.
Create a new Cluster Environment. In the post build commands, add the following commands:

# Write the Vector config.
sudo mkdir -p /etc/vector/
echo '<serialized vector config>' | base64 -d | sudo tee /etc/vector/vector.yaml

# Write the SupervisorD config.
sudo mkdir -p /etc/supervisor/customer.conf.d/
echo '<serialized supervisord config>' | base64 -d | sudo tee /etc/supervisor/customer.conf.d/supervisord.conf

# Install Vector.
sudo apt-get update && sudo apt install -y curl
curl --proto '=https' --tlsv1.2 -sSfL https://sh.vector.dev | bash -s -- -y

# Create a state directory for Vector.
sudo mkdir -p /var/lib/vector/
sudo chmod 777 /var/lib/vector/

echo "Vector installed & configured to run inside supervisord."

Build the Cluster Environment, and start an Anyscale Job or Service using it.

Step 0: Requirements​

Step 1: Write a Vector Configuration File​

Source/Transform Configuration​

Sink Configuration​

Step 2: Test the Configuration File​

Step 3: Move to Production​