Train Dashboard
The Train dashboard streamlines the debugging of Ray Train workloads. This dashboard enables you to gain deeper insights into individual workers' progress, pinpoint stragglers, and identify bottlenecks for faster, more efficient training.
This dashboard should be the starting point for debugging any issue with your Train workload. It links to other pages of the Anyscale dashboard for more detailed information about the workload, such as logs, metrics, tasks, actors, or nodes.
Accessing the Train Dashboard
To access Train workload dashboards, click the Workloads tab in the Jobs or Workspaces page. Then, select the Train tab.
Train dashboard overview
The Train Dashboard provides a high-level overview of the training workload and its progress. It starts with a list of train runs which can have multiple attempts. Each Train attempt can have multiple train workers which is running your training code. New attempts are created whenever the Train run retries due to failure or scales up or down due to elastic training.
Compatibility
The Train Dashboard only supports Ray 2.30.0 or later. For data persistence, the dashboard requires Ray 2.44.0 or later. Although Ray Train V2 is not required, using Ray Train V2 will provide a better debugging experience as Ray Train V2 supports controller logs and structured worker logs.