Tuning configs for cost and performance
Making models train faster is usually directly related to reducing cost. However, that's not the only objective that you could choose to optimize. For example, you might not have access to high end Nvidia GPUs like A100s and H100s, but still want to fine-tune LLMs on cheaper GPUs like L4s or A10s. Within this limitation you might want to:
- unblock yourself, even if it means more $/flops, and
- optimize for the fastest training configuration while maintaining the quality of the models
This guide discusses the parameters that can help squeeze more flops out of GPUs and train models faster.
Overview
Choosing the correct type of GPUs and appropriate combination of configurations isn't easy to prescribe. It usually requires running benchmarks yourself and being able to quickly understand the trade-offs for speed and memory. The internet has a large collection of tutorials that can give you holistic view of how to think about performance of training LLMs. The following two are very good resources:
- https://sumanthrh.com/post/distributed-and-efficient-finetuning/
- https://github.com/stas00/ml-engineering/blob/master/training/performance/README.md (more advanced)
When thinking about training models, there are two issues that come up in order of importance:
- What parameters should you set to make sure you don't run out of GPU Memory, like hitting CUDA OOM, during training?
- What parameters would result in higher throughput or reduce cost?
To fine-tune a large model like a 70B-sized model, the priority is first fitting the model into GPU memory. In addition, the following are important factors that relate to cost and throughput to consider:
- Cluster Shape: This relates to the instance type/s (CPU RAM, the number and type of GPUs) and how many of them you want to use.
- Efficient Training Configurations: These are knobs such as
batch_size
,context_length
,gradient_accumulation_steps
, etc. that affect performance per-GPU. - Distributed Training Configurations: This relates to how training state is split across your cluster and thus overall performance.
Often, these different knobs are interdependent, and there is no one-size-fits-all configuration. We have a guide on choosing cluster shape and ZeRO stage that covers two of these in detail. In this guide, we'll focus on how you can tweak different parameters in your llmforge
fine-tuning config from all the above buckets, and also how you can benchmark throughput yourself.
Configurations
This section covers a set of configurations that allow more customization in throughput optimization. See the LLMForge reference page for more information. For details on choosing ZeRO stage along with cluster shape, see Choosing cluster shape and ZeRO stage.
Instance type
You can modify the cluster configuration by changing worker_resources
and trainer_resources
.
The following is how you specify different GPU types in the YAML:
GPU type | Resource Specification in YAML |
---|---|
A10 | accelerator_type:A10G: 0.001 |
L4 | accelerator_type:L4: 0.001 |
A100-40G* | accelerator_type:A100-40G: 0.001 |
A100-80G* | accelerator_type:A100-80G: 0.001 |
H100* | accelerator_type:H100: 0.001 |
* subject to availability.
For example, if you want to train on A100-80G GPUs, you can specify the following in the YAML:
num_devices: 8
trainer_resources: {}
worker_resources:
accelerator_type:A100-80G: 0.001
Anyscale sets up all default configs for A10G machines for better accessibility.
Additionally you can specify the cluster shape. See Choosing cluster shape and ZeRO stage for details.
Difference between worker_resources
and trainer_resources
In LoRA training you can often ignore trainer_resources
and just provide the GPU type required for doing training under worker_resources
similar to the example above.
However, for full-parameter training on heterogeneous clusters (for example training 70B on A10G GPUs) it is a bit more convoluted. This is generally not recommended, but when you do not have access to X100 machines this is the only way you can still train super large models (fp16 and full-parameter).
Ray Train allows users to specify a different set of resources for rank-0 vs. other ranks. Rank-0 is responsible for checkpointing and normally needs more CPU RAM than the other workers at the time of checkpointing, because of the implementation details around weight aggregation. In a heterogeneous multi-node setting where you have both small nodes and large nodes with the same GPUs this can cause a problem, because it becomes important where rank-0 is stored and the cluster is not symmetrically used all the time.
A prime example is running fine-tuning on A10G
s. On AWS A10s
are available in g5.4xlarge
with 1 GPU and small RAM capacity all the way to g5.48xlarge
with 8 GPUs and large RAM. During checkpointing a large model like 70B, the CPU RAM on g5.4xlarge isn't sufficient and so you have to define memory
requirement for trained_resources
to ensure that the large instance gets picked for rank-0:
num_devices: 32
# Rank-0 (aka trainer) should have 140 GB memory.
# This memory is required for weight aggregation.
trainer_resources:
memory: 150_323_855_360 # 140 GB memory
# Accelerator type, the value of 0.001 is not important, as long as it is
# between 0 and 1. This ensures that the given accelerator is available for each trainer
# worker.
worker_resources:
memory: 53_687_091_200 # 50 GB memory
accelerator_type:A10G: 0.001
This config requests 32xA10G GPUs but doesn't specify the architecture, for example g5.4xlarge
vs. g5.48xlarge
. However, it specifies that there should be at least 130G of memory available for rank-0, which forces the Anyscale's instance manager to pick g5.48xlarge for rank-0 and any other A10s for other workers to avoid hitting RAM OOM issues during checkpointing. This config is more flexible than forcing Anyscale to use 4xg5.48xlarge
nodes, which may not be available on-demand and might be more expensive. The Anyscale autoscaler prioritizes instances with the highest number of GPUs to maximize locality. If there's insufficient capacity, it selects multiple instances in decreasing order of their GPU count.
Batch size and Context length
Usually for large models you want to saturate the GPUs memory by choosing the largest batch size before you OOM. Because sequence length dimensions are mostly flattened for linear layer computation, the batch size for batch matrix-multiplication is effectively sequence_length
x batch_size_per_device
. Thus, GPUs remain in their high-performant regime even if per device batch size is seemingly small.
To increase batch size beyond the capacity of GPUs, you either need to use more instances, or use gradient accumulation, which may be a more cost effective option when the alternative is scaling from single-node to multi-node training.
If you don't specify batch_size
or context_length
, Anyscale automatically infers their values based on heuristics that may not be optimal for your use case. Anyscale chooses batch_size
using a look up table of previous runs based on the context_length
. Anyscale chooses the context_length
to be the maximum of the model's default context length and the 95th percentile of the sequence length of the dataset.
To change these configs you can simply set the following values in the config:
context_length: <ctx_length>
train_batch_size_per_device: <bs>
# You can also change the validation batch size.
eval_batch_size_per_device: <bs>
Gradient checkpointing
Gradient/Activation checkpointing is a method used to lower memory usage at the cost of extra computation. This is done by selectively maintaining a subset of activations from the forward pass, resulting in extra computation needing to be done in the backwards pass to recompute the full set of activations. According to HuggingFace, a general rule of thumb is that gradient checkpointing slows down training by about 20%.
If additional GPU memory is available, disabling gradient checkpointing is a way to potentially increase throughput. By default, LLMForge keeps gradient checkpointing on, but this can be configured by setting gradient_checkpointing
to False
in the fine tuning config:
gradient_checkpointing: False
Gradient accumulation
Use the gradient accumulation method to effectively simulate larger batch sizes when you're memory constrained. A side effect of this approach is less frequent optimizer steps, which can result in lower communication costs when sharding optimizer states across ranks, and a possible increase in throughput. Configure the number of gradient accumulation steps by setting gradient_accumulation_steps
in the fine tuning config.
gradient_accumulation_steps: 2
Liger-Kernel
Liger-Kernel is a collection of Triton kernels for LLM training, optimized for speeding up and reducing memory overhead for full fine-tuning workloads. See the full list of models that you can patch with Liger Kernel. Set the liger_kernel
field to True
in the config to use Liger Kernel in LLMForge.
liger_kernel: True
The table below shows the performance gains obtained from using Liger Kernel for full parameter fine-tuning of a Llama-3-8B model on the GSM8K dataset with LLMForge.
How do you benchmark throughput yourself?
For benchmarking, you want to fix context length to a constant value so that it doesn't vary across different batches. You need to set the padding strategy to max_length
because the default is longest
. If you use longest
padding, you may observe inconsistent errors while profiling due to the context length of the batch varying every iteration. To set the padding strategy in the config, make sure you have the following:
padding: "max_length"
Then run LLMForge fine-tuning, letting the job run for a few iterations to note the time it takes for different steps. The logs have a table with relevant metrics. You can also automatically log and externally view these metrics by setting up integrations with MLflow and W&B.
llmforge anyscale finetune <path_to_config.yaml>
For GPU memory, see if you hit any OOM errors during training. You can monitor GPU utilization by looking at the Ray dashboard while the training profile is running.
For speed, the metrics that are important during profiling are fwd_time
and bwd_time
. These metrics are sufficient for capturing the relative improvements in the throughput to understand the trade-off between different choices.
Model Flop Utilization (MFU) is a metric that measures efficiency of hardware utilization for training:
MFU = (Approximate model flops during fwd + bwd pass) / (total update time per step) / (Hardware TFLOPS)
Below is a rough breakdown of how to approximate each of these terms:
Approximate model flops during fwd + bwd pass = 2 * 3 * ctx_len * bs_per_device * num_model_params
The factor 2
is for conversion of MACs to Flops, 3
is for approximation of a backward pass taking ~2 times more flops than a forward pass.
total update time per step ~ fwd_time + bwd_time
This calculation should include data ingestion speed as well, but assume this time isn't significant because you're using Ray Data to overlap ingestion with compute.
Hardware TFLOPS = Per device flops capacity per GPU spec and the data type tensor cores
For example for A100s when doing bf16 training this is 312 TFLOPS.
You can use this methodology to compute MFUs and compare different configurations with each other.
For more concrete comparisons, below are some relevant performance numbers across different models, measured as of January 2024.
1xP4DE.24xlarge (LoRA)
The numbers below were measured on an AWS 1xP4DE node with 8xA100-80 for LoRA fine-tuning. The effective cost computed is based on the hourly rate of on-demand price on AWS ($40.77/hr). The charge is still based on instance hours used, but this calculation gives a good comparison basis to token-based pricing. This cost doesn't consider startup time, checkpointing, etc.
Llama-2-7B
Context Length | Bsize per device | Token Throughput -- TT (MT/hr) | MFU | Effective Cost ($/MT) |
---|---|---|---|---|
512 | 64 | 101.47 | 0.47 | 0.74 |
1024 | 32 | 103.7 | 0.48 | 0.72 |
2048 | 8 | 99.75 | 0.47 | 0.71 |
4096 | 4 | 102.58 | 0.48 | 0.69 |
Llama-2-70B
Context Length | Bsize per device | Token Throughput -- TT (MT/hr) | MFU | Effective Cost ($/MT) |
---|---|---|---|---|
512 | 16 | 11.23 | 0.53 | 3.05 |
1024 | 8 | 9.65 | 0.45 | 4.77 |
2048 | 4 | 8.58 | 0.40 | 4.25 |
4096 | 2 | 13.40 | 0.63 | 3.65 |
Mixtral-8x7B
Context Length | Bsize per device | Token Throughput -- TT (MT/hr) | MFU | Effective Cost ($/MT) |
---|---|---|---|---|
512 | 64 | 59.73 | 0.56 | 0.41 |
1024 | 32 | 57.20 | 0.53 | 0.40 |
2048 | 16 | 56.85 | 0.53 | 0.40 |
4096 | 8 | 55.84 | 0.52 | 0.40 |