Train dashboard
The Train dashboard streamlines the debugging of Ray Train workloads. This dashboard enables you to gain deeper insights into individual workers' progress, pinpoint stragglers, and identify bottlenecks for faster, more efficient training.
This dashboard should be the starting point for debugging any issue with your Train workload. It links to other pages of the Anyscale dashboard for more detailed information about the workload, such as logs, metrics, tasks, actors, or nodes.
Requirements
- The Train dashboard requires Ray 2.30.0 or above.
- Data persistence to view runs from past Ray sessions requires Ray 2.44.0 or above.
- Use Ray Train V2 to provide a better debugging experience. You must use Ray Train V2 for the dashboard to show controller logs and structured worker logs.
Access the Train dashboard
To access Train workload dashboards, click the Workloads tab in the Jobs or Workspaces page. Then, select the Train tab.
The Train dashboard presents information in a hierarchy, with groupings of sessions, runs, attempts, and workers.
Monitor a Train run
Each run corresponds to a single call to trainer.fit()
and is tracked as a unique training execution.
The following table describes the fields displayed for each run:
Field | Description |
---|---|
ID | The unique ID of a Train run. |
Name | The run name set using the ray.train.RunConfig(name) configuration. |
Status | The current state of the Train run, for example, INITIALIZING , RUNNING , or ERRORED . |
Status Details | Additional info, such as stack traces if the run failed. |
Controller Info | Actor ID and logs for the controller managing the run. |
The Train dashboard home page shows a list of runs from the current Ray session. Navigate to an individual run page by clicking on the row that you want to inspect.
Debug Train runs with controller logs
Each run spawns a controller, which is a dedicated actor with the following responsibilities:
- Spawns and monitors workers
- Handles failure recovery
- Logs global information including failure handling and scheduling decisions, which you can view in the run page
View the history of Train run attempts
Each run begins as a first run attempt. The run moves on to a new attempt when:
- You configure Ray Train worker fault tolerance, and the run retries after encountering a worker failure.
- You enable elastic training, and the worker group scales up or down.
The following table describes the fields displayed for each attempt:
Field | Description |
---|---|
Attempt # | Attempt index within a run |
Status | State of the attempt, for example, SUCCEEDED , ERRORED , or RUNNING |
Status details | Additional information about the attempt (for example, the error that caused the attempt to fail) |
Run attempts are listed at the bottom of the run page, with the latest attempt showing up at the top.
Inspect worker logs and metadata
Each attempt consists of a group of workers executing the user-defined distributed training code.
The following table describes the fields displayed for each worker:
Field | Description |
---|---|
Actor ID | Unique Ray actor ID for the worker process |
Status | Process status (ALIVE , DEAD ) |
World Rank | Index of the worker within the group |
Local Rank | Index among workers on the same node (matches GPU ID if applicable) |
Node Rank | Index of the node running the worker |
PID | Process ID |
Node IP | IP address of the node running the worker |
Navigate to the training worker view by clicking on a worker link in a run attempt. Access worker logs, metrics, and profiling tools from here: