Skip to main content

Train dashboard

Train dashboard

The Train dashboard streamlines the debugging of Ray Train workloads. This dashboard enables you to gain deeper insights into individual workers' progress, pinpoint stragglers, and identify bottlenecks for faster, more efficient training.

This dashboard should be the starting point for debugging any issue with your Train workload. It links to other pages of the Anyscale dashboard for more detailed information about the workload, such as logs, metrics, tasks, actors, or nodes.

Requirements

  • Data persistence to view runs from past Ray sessions requires Ray 2.44.0 or later.
  • For the best debugging experience, use Ray 2.51.0 or later. Ray Train V2 is enabled by default in Ray 2.51.0 and later, providing controller logs and structured worker logs in the dashboard. For earlier Ray versions, you must manually enable Ray Train V2.

Access the dashboard

To access Train workload dashboards, click the Workloads tab in the Jobs or Workspaces page. Then, select the Train tab.

The Train dashboard presents information in a hierarchy, with groupings of sessions, runs, attempts, and workers.

Monitor a Train run

Each run corresponds to a single call to trainer.fit() and represents a unique training execution.

The following table describes the fields displayed for each run:

FieldDescription
IDThe unique ID of a Train run.
NameThe run name set using the ray.train.RunConfig(name) configuration.
StatusThe current state of the Train run, for example, INITIALIZING, RUNNING, or ERRORED.
Status DetailsAdditional info, such as stack traces if the run failed.
Controller InfoActor ID and logs for the controller managing the run.

The Train dashboard home page shows a list of runs from the current Ray session. Navigate to an individual run page by clicking on the row that you want to inspect.

Train Runs

Debug Train runs with controller logs

Each run spawns a controller, which is a dedicated actor with the following responsibilities:

  • Spawns and monitors workers
  • Handles failure recovery
  • Logs global information including failure handling and scheduling decisions, which you can view in the run page

Error attribution and debugging

When training jobs fail, the Train dashboard provides rich context to help you quickly diagnose and address the root cause. The dashboard automatically surfaces detailed information about failures, eliminating the need to manually piece together logs from different systems.

For each failure, use the dashboard to identify the following:

  • Specific workers affected: Identify which worker processes encountered errors
  • Error classification: Distinguish between application errors—bugs in your training code—and hardware issues, such as GPU failures or out-of-memory conditions
  • Detailed stack traces: View complete error stack traces directly in the run's Status Details field
  • Relevant node logs: Access system-level logs that show hardware-related issues such as GPU errors or memory problems
  • Historical context: When using fault tolerance or elastic training, view error details from previous attempts to understand patterns

The dashboard marks failed runs as ERRORED and provides status details explaining where the error originated. Individual workers also display their own status details, including relevant node logs when hardware failures occur.

This comprehensive error context enables faster debugging by providing all relevant information in a single interface, reducing the time spent searching through scattered logs and metrics.

View the history of Train run attempts

Each run begins as a first run attempt. The run moves on to a new attempt when:

The following table describes the fields displayed for each attempt:

FieldDescription
Attempt #Attempt index within a run
StatusState of the attempt, for example, SUCCEEDED, ERRORED, or RUNNING
Status detailsAdditional information about the attempt (for example, the error that caused the attempt to fail)

The run page lists attempts at the bottom, with the most recent attempt at the top.

Inspect worker logs and metadata

Each attempt consists of a group of workers executing the user-defined distributed training code.

The following table describes the fields displayed for each worker:

FieldDescription
Actor IDUnique Ray actor ID for the worker process
StatusProcess status (ALIVE, DEAD)
World RankIndex of the worker within the group
Local RankIndex among workers on the same node (matches GPU ID if applicable)
Node RankIndex of the node running the worker
PIDProcess ID
Node IPIP address of the node running the worker

Navigate to the training worker view by clicking on a worker link in a run attempt. Access worker logs, metrics, and profiling tools from here:

Worker logs