Version: Canary 🐤

Tuning Configs for Cost or Performance

Making models train faster is directly related to reducing the cost. But that is not the only objective that you could optimize. Some people may not have access to the latest and greatest Nvidia GPUs, but still want to fine-tune LLMs on cheaper GPUs like L4s or A10s. So within this limitation you also want to 1) unblock yourself (even if it means more $/flops) and 2) optimize for best configuration while maintaining the quality of the models. In this guide, we will discuss the parameters that are exposed that can help squeeze more flops out of GPUs and train models faster.

Overview

Picking the correct type of GPUs and right combination of configuration is not easy to prescribe. It usually comes down to running benchmarks yourself and be able to quickly understand the trade-offs for speed and memory. There is a large collection of tutorials on the internet that can give you holistic view of how to think about performance of training LLMs. The following two are very good resources:

When thinking about training models, there are two issues that come up in order of importance:

What parameters should I set to make sure I do not run out of GPU Memory (hit CUDA OOM) during training?
What parameters would give me a better throughput and therefore reduce my cost?

In order to fit a large model (say 70B sized model) the priority is first fitting the model. If you cannot train the model even with batch size = 1, then increasing batch size may not help.

How do you benchmark throughput yourself?

For benchmarking we want to fix context length to a constant value so that it does not vary across different batches. To do this we need to make padding strategy max_length (the default is longest). If you use longest, you may observe that you did not run out of memory for a few iteration but then did in a later iteration and you might miss that during initial profiling. This is because the context length of the batch can vary at every iteration and we want to remove that as a confounding factor, so in the config YAML make sure you have the following:

padding: "max_length"

Then you want to run the LLMForge fine-tuning and run the job for a few iterations to note down the time it takes for different steps. In the logs, we print a table with what you need, but you also can go to WANDB and obtain the metrics from there.

llmforge anyscale finetune <path_to_config.yaml>

For GPU memory, you basically want to see if you hit any OOM errors during training. You can monitor GPU utilization by looking at the Ray dashboard while the training profile is running.

A screenshot of the Ray Dashboard showing the Metrics tab, with node GPU and memory usage graphs over time.

For speed, the metrics that are important during profiling are fwd_time and bwd_time. These metrics are good enough to capture the relative improvements in the throughput to understand the trade-off between different choices.

A screenshot of a table of training results, with the bwd_time and fwd_time parameters highlighted.

Model Flop Utilization (MFU) (read more here) is usually calculated as a measure for efficiency of hardware utilization for training:

MFU = (Approximate model flops during fwd + bwd pass) / (total update time per step) / (Hardware TFLOPS)

And here is a rough breakdown of how each thing should be plugged in:

Approximate model flops during fwd + bwd pass = 2 * 3 * ctx_len * bs_per_device * num_model_params

The factor 2 is for conversion of MACs to Flops, 3 is for approximation of the fact that backward pass takes ~2x more flops than forward pass.

total update time per step ~ fwd_time + bwd_time

In reality this should include data ingestion speed as well, but we assume this is not significant since we are using Ray Data to overlap ingestion with compute.

Hardware TFLOPS = Per device flops capacity per GPU spec and the data type tensor cores

For example for A100s when doing bf16 training this is 312 TFLOPS.

You can use this methodology to compute MFUs and compare different configurations with each other.

Here is some performance numbers we measured (in Jan 2024)

1xP4DE.24xlarge (LoRA)

The numbers below were measured on a 1xP4DE node with 8xA100-80 for LoRA fine-tuning. The effective cost computed here is based on the hourly rate of on-demand price on AWS (that is $40.77/hr). The charge is still based on instance hours used, but this gives a good comparison basis to token-based pricing. This cost does not consider startup time, checkpointing, etc

Llama-2-7B

Context Length	Bsize per device	Token Throughput -- TT (MT/hr)	MFU	Effective Cost ($/MT)
512	64	101.47	0.47	0.74
1024	32	103.7	0.48	0.72
2048	8	99.75	0.47	0.71
4096	4	102.58	0.48	0.69

Llama-2-70B

Context Length	Bsize per device	Token Throughput -- TT (MT/hr)	MFU	Effective Cost ($/MT)
512	16	11.23	0.53	3.05
1024	8	9.65	0.45	4.77
2048	4	8.58	0.40	4.25
4096	2	13.40	0.63	3.65

Mixtral-8x7B

Context Length	Bsize per device	Token Throughput -- TT (MT/hr)	MFU	Effective Cost ($/MT)
512	64	59.73	0.56	0.41
1024	32	57.20	0.53	0.40
2048	16	56.85	0.53	0.40
4096	8	55.84	0.52	0.40

Configurations

In this section we cover a set of configurations that allow more customization in throughput optimization. Visit docs for more info.

Instance type

You can modify the cluster configuration by changing worker_resources and trainer_resources.

The following is how you specify different GPU types in the YAML:

GPU type	Resource Specification in YAML
A10	accelerator_type:A10G: 0.001
L4	accelerator_type:L4: 0.001
A100-40G*	accelerator_type:A100-40G: 0.001
A100-80G*	accelerator_type:A100-80G: 0.001
H100*	accelerator_type:H100: 0.001

* subject to availability.

For example, if you want to train on A100-80G GPUs, you can specify the following in the YAML:

num_devices: 8
trainer_resources: {}
worker_resources:
  accelerator_type:A100-80G: 0.001

All of our default configs are setup for A10G machines due to better accessibility.

Difference between `worker_resources` and `trainer_resources`

In LoRA training you can often ignore trainer_resources and just provide the GPU type required for doing training under worker_resources similar to the example above.

However, for full-parameter training on heterogeneous clusters (for example training 70B on A10G GPUs) it is a bit more convoluted. This is generally not recommended, but when you do not have access to X100 machines this is the only way you can still train super large models (fp16 and full-parameter).

Ray Train allows users to specify a different set of resources for rank-0 vs. other ranks. Rank-0 is responsible for checkpointing and normally needs more CPU RAM than the other workers at the time of checkpointing, because of the implementation details around weight aggregation. In a heterogeneous multi-node setting where you have both small nodes and large nodes with the same GPUs this can cause a problem, because it becomes important where rank-0 is stored and the cluster is not symmetrically used all the time.

A prime example is running fine-tuning on A10Gs. On AWS A10s are available in g5.4xlarge with 1 GPU and small RAM capacity all the way to g5.48xlarge with 8 GPUs and large RAM. During checkpointing a large model like 70B, the CPU RAM on g5.4xlarge is not sufficient and hence we have to define memory requirement for trained_resources to ensure that the large instance gets picked for rank-0:

num_devices: 32
# Rank-0 (aka trainer) should have 140 GB memory
# This memory is required for weight aggregation.
trainer_resources:
  memory: 150_323_855_360 # 140 GB memory

# Accelerator type, the value of 0.001 is not important, as long as it is
# between 0 and 1. This ensures that the given accelerator is available for each trainer
# worker.
worker_resources:
  memory: 53_687_091_200 # 50 GB memory
  accelerator_type:A10G: 0.001

This configs is asking for 32xA10G GPUs but does not specify the architecture (for example g5.4xlarge vs. g5.48xlarge). However, it specifies that there should be at least 130G of memory available for rank-0, which forces the Anyscale's instance manager to pick g5.48xlarge for rank-0 and any other A10s for other workers hence not hitting RAM OOM issues during checkpointing. This is much better than being forced to use 4xg5.48xlarge nodes. They may not be available on-demand and might be more expensive too. The Anyscale autoscaler will prioritize instances with the highest number of GPUs to maximize locality. If there is insufficient capacity, it will proceed to select multiple instances in decreasing order of their GPU count.

Batch size and Context length

Usually for large models we want to saturate the GPUs memory by choosing the largest batch size before we OOM. If you tried to fit a micro batch size of 4 and it ran out of memory, but 3 fits instead of 2. Because, sequence length dimensions are mostly flattened for linear layer computation the batch size for batch matrix-multiplication is sequence_length x batch_size_per_device. So the GPUs would remain in their high-performant regime even if batch size is seemingly small.

To increase batch size beyond the capacity of GPUs, you either need to user more instances, or use gradient accumulation which may be a better option when the difference is between multi-node vs. single node.

If you don't specify batch size or context length, Anyscale automatically infers based on some heuristics that may not be optimal for your use case. Anyscale chooses batch size using a look up table of previous runs based on the context length. Anyscale chooses the context length to be the maximum of the model's default context length and the 95th percentile of the sequence length of the dataset.

To change these configs you can simply do:

context_length: <ctx_length>
train_batch_size_per_device: <bs>
# You can also change the validation batch size
eval_batch_size_per_device: <bs>

DeepSpeed configs

By default, use Zero-DP stage 3 with parameter and the optimizer's state offloading, which is the safest in terms of memory consumption. But this setting is not the fastest. You can find other typical DeepSpeed configurations under deepspeed_configs. Find the complete doc of all the configurations on the DeepSpeed doc page.

You can try deactivating parameter offloading for smaller models to speed the training up, if you still have room before running out of memory or your context length is small enough to leave some room for memory. You can also change the states of Zero-DP to see how they speed the training up for your use-case. In the YAML all you have to do is change something like:

deepspeed:
  config_path: deepspeed_configs/zero_3.json

You can also set different stages of Zero-DP to see how they affect the training speed.

Stage 0
Stage 1
Stage 2
Stage 3

Description: This stage represents the baseline without any memory optimizations. All optimizer states, gradients, and parameters are fully replicated across all GPUs.

Activation checkpointing

Activation checkpointing allows you decrease memory usage for computing backpropagation by not saving the activations at all the layers and instead recomputing the activations as you are backpropagating. There is some overlaps between backward pass and activation checkpointing to minimize the overhead but at the end of the day you are spending more Flops and it will have some negative impact on throughput.

So if you have room for GPU memory, one of the ways you can increase throughput is by turning activation checkpoint off. Having said that, most of the time the impact on memory is so big that we usually see OOM errors after turning it off. To turn it off you can update the YAML:

no_gradient_checkpoint: True

Gradient accumulation

Gradient accumulation is another way of increasing the batch size without using more instances. This can become useful when the increasing batch size directly would require going multi-node and the node-interconnects are slow. It might be better to increase the gradient accumulation step in these cases to get less slower than going multi-node. To do this you can easily configure the gradient_accumulation_steps parameter in the YAML:

gradient_accumulation_steps: 2

Overview

How do you benchmark throughput yourself?​

1xP4DE.24xlarge (LoRA)​

Llama-2-7B​

Llama-2-70B​

Mixtral-8x7B​

Configurations​

Instance type​

Difference between worker_resources and trainer_resources​

Batch size and Context length​

DeepSpeed configs​

Activation checkpointing​

Gradient accumulation​