Skip to main content

Train dashboard

The Train dashboard streamlines the debugging of Ray Train workloads. This dashboard enables you to gain deeper insights into individual workers' progress, pinpoint stragglers, and identify bottlenecks for faster, more efficient training.

This dashboard should be the starting point for debugging any issue with your Train workload. It links to other pages of the Anyscale dashboard for more detailed information about the workload, such as logs, metrics, tasks, actors, or nodes.

Requirements

  • The Train dashboard requires Ray 2.30.0 or above.
  • Data persistence to view runs from past Ray sessions requires Ray 2.44.0 or above.
  • Use Ray Train V2 to provide a better debugging experience. You must use Ray Train V2 for the dashboard to show controller logs and structured worker logs.

Access the Train dashboard

To access Train workload dashboards, click the Workloads tab in the Jobs or Workspaces page. Then, select the Train tab.

The Train dashboard presents information in a hierarchy, with groupings of sessions, runs, attempts, and workers.

Monitor a Train run

Each run corresponds to a single call to trainer.fit() and is tracked as a unique training execution.

The following table describes the fields displayed for each run:

FieldDescription
IDThe unique ID of a Train run.
NameThe run name set using the ray.train.RunConfig(name) configuration.
StatusThe current state of the Train run, for example, INITIALIZING, RUNNING, or ERRORED.
Status DetailsAdditional info, such as stack traces if the run failed.
Controller InfoActor ID and logs for the controller managing the run.

The Train dashboard home page shows a list of runs from the current Ray session. Navigate to an individual run page by clicking on the row that you want to inspect.

Train Runs

Debug Train runs with controller logs

Each run spawns a controller, which is a dedicated actor with the following responsibilities:

  • Spawns and monitors workers
  • Handles failure recovery
  • Logs global information including failure handling and scheduling decisions, which you can view in the run page

View the history of Train run attempts

Each run begins as a first run attempt. The run moves on to a new attempt when:

The following table describes the fields displayed for each attempt:

FieldDescription
Attempt #Attempt index within a run
StatusState of the attempt, for example, SUCCEEDED, ERRORED, or RUNNING
Status detailsAdditional information about the attempt (for example, the error that caused the attempt to fail)

Run attempts are listed at the bottom of the run page, with the latest attempt showing up at the top.

Inspect worker logs and metadata

Each attempt consists of a group of workers executing the user-defined distributed training code.

The following table describes the fields displayed for each worker:

FieldDescription
Actor IDUnique Ray actor ID for the worker process
StatusProcess status (ALIVE, DEAD)
World RankIndex of the worker within the group
Local RankIndex among workers on the same node (matches GPU ID if applicable)
Node RankIndex of the node running the worker
PIDProcess ID
Node IPIP address of the node running the worker

Navigate to the training worker view by clicking on a worker link in a run attempt. Access worker logs, metrics, and profiling tools from here:

Worker logs