Train dashboard

The Train dashboard streamlines the debugging of Ray Train workloads. This dashboard enables you to gain deeper insights into individual workers' progress, pinpoint stragglers, and identify bottlenecks for faster, more efficient training.

This dashboard should be the starting point for debugging any issue with your Train workload. It links to other pages of the Anyscale dashboard for more detailed information about the workload, such as logs, metrics, tasks, actors, or nodes.

Requirements

The Train dashboard requires Ray 2.30.0 or above.
Data persistence to view runs from past Ray sessions requires Ray 2.44.0 or above.
Use Ray Train V2 to provide a better debugging experience. You must use Ray Train V2 for the dashboard to show controller logs and structured worker logs.

Access the Train dashboard

To access Train workload dashboards, click the Workloads tab in the Jobs or Workspaces page. Then, select the Train tab.

The Train dashboard presents information in a hierarchy, with groupings of sessions, runs, attempts, and workers.

Monitor a Train run

Each run corresponds to a single call to trainer.fit() and is tracked as a unique training execution.

The following table describes the fields displayed for each run:

Field	Description
ID	The unique ID of a Train run.
Name	The run name set using the `ray.train.RunConfig(name)` configuration.
Status	The current state of the Train run, for example, `INITIALIZING`, `RUNNING`, or `ERRORED`.
Status Details	Additional info, such as stack traces if the run failed.
Controller Info	Actor ID and logs for the controller managing the run.

The Train dashboard home page shows a list of runs from the current Ray session. Navigate to an individual run page by clicking on the row that you want to inspect.

Train Runs

Debug Train runs with controller logs

Each run spawns a controller, which is a dedicated actor with the following responsibilities:

Spawns and monitors workers
Handles failure recovery
Logs global information including failure handling and scheduling decisions, which you can view in the run page

View the history of Train run attempts

Each run begins as a first run attempt. The run moves on to a new attempt when:

You configure Ray Train worker fault tolerance, and the run retries after encountering a worker failure.
You enable elastic training, and the worker group scales up or down.

The following table describes the fields displayed for each attempt:

Field	Description
Attempt #	Attempt index within a run
Status	State of the attempt, for example, `SUCCEEDED`, `ERRORED`, or `RUNNING`
Status details	Additional information about the attempt (for example, the error that caused the attempt to fail)

Run attempts are listed at the bottom of the run page, with the latest attempt showing up at the top.

Inspect worker logs and metadata

Each attempt consists of a group of workers executing the user-defined distributed training code.

The following table describes the fields displayed for each worker:

Field	Description
Actor ID	Unique Ray actor ID for the worker process
Status	Process status (`ALIVE`, `DEAD`)
World Rank	Index of the worker within the group
Local Rank	Index among workers on the same node (matches GPU ID if applicable)
Node Rank	Index of the node running the worker
PID	Process ID
Node IP	IP address of the node running the worker

Navigate to the training worker view by clicking on a worker link in a run attempt. Access worker logs, metrics, and profiling tools from here:

Worker logs

Requirements​

Access the Train dashboard​

Monitor a Train run​

Debug Train runs with controller logs​

View the history of Train run attempts​