Train dashboard profiling tools
On-demand GPU profiling
This feature requires Ray 2.47.0 and above. Use this feature to profile PyTorch training runs for PyTorch versions 2.0 and above.
The Train dashboard allows you to take an on-demand GPU profile of a PyTorch training run to generate a trace that shows a timeline of CPU and GPU operations. Use this trace to diagnose training bottlenecks and gain a better understanding of the computation and communication operations that are happening under the hood.
The following is an example of a trace visualization generated from on-demand GPU profiling on Anyscale. This trace can help identify bottlenecks on both the CPU and GPU sides by showing a timeline of CPU operations and GPU kernels. The example trace shows a collective all-reduce operation which is part of the Distributed Data Parallel algorithm for distributed training.
Configure GPU profiling for Anyscale
This feature relies on Dynolog. Complete the following steps to set up dependencies:
- Anyscale base images include Dynolog binaries for all Ray versions 2.47.0 and above. If you are building your own image and not extending an Anyscale base image, install the Dynolog binaries on your container image. See the installation instructions on the Dynolog repo.
- Set the
KINETO_USE_DAEMON
andKINETO_DAEMON_INIT_DELAY_S
environment variables on the training workers. Here's how you can do this with Ray Train:
trainer = ray.train.torch.TorchTrainer(
...,
run_config=ray.train.RunConfig(
...,
worker_runtime_env={
"env_vars": {"KINETO_USE_DAEMON": "1", "KINETO_DAEMON_INIT_DELAY_S": "5"}
},
)
)
Collect a GPU profile
Complete the following steps to generate a profile for your GPU training worker.
- Navigate to an active Train run page.
- Click on the
GPU Profiling
button on one of the workers. - Enter the profiling duration in the configuration window that appears.
- Wait for profiling to finish. The profiling result will be downloaded as a JSON file that can be viewed in
chrome://tracing
on a Chrome browser or Perfetto trace viewer.