Fast model loading (alpha)
This is an alpha feature that is subject to change. Contact Anyscale support with any feedback.
You can use the Anyscale optimized streaming safetensors client with RayLLM to speed up model loading times. For more details on how fast model loading works, see the blog post.
Enabling fast model loading requires three steps:
- Downloading and preprocessing the weights for the model.
- Uploading the preprocessed model weights to cloud storage.
- Updating your RayLLM configuration to use the preprocessed model weights.
Follow along with these steps in an Anyscale workspace.
Downloading and preprocessing model weights
First, use the rayllm anytensor preprocess
CLI to download model weights.
This command downloads the weights from Hugging Face, preprocesses them into the inference format used by RayLLM, and then writes them to the specified output directory.
Important notes:
- RayLLM writes the model weights to local disk. Use the local disk mounted by Anyscale automatically at
/mnt/local_storage
and ensure you have enough disk space. - You must perform preprocessing separately for each tensor parallelism degree you plan to use. Specify the degree using the
--tp-degree
flag.
Below is an example of preprocessing model weights for Mistral-7B-Instruct-v0.1
with tensor parallelism degree 1
.
No TP rank provided, generating weights for all ranks of degree 1
Generating weights for TP rank 0 / 0
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 3.76it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.79it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.94it/s]
Writing weights for rank 0 to: /mnt/local_storage/Mistral-7B-Instruct-v0.1/rank-00000.safetensors
Finished writing /mnt/local_storage/Mistral-7B-Instruct-v0.1/rank-00000.safetensors, size: 13.49 GB
Uploading preprocessed model weights to cloud storage
Next, upload the preprocessed model weights to cloud storage. Anyscale manages an artifact storage bucket automatically for each cloud.
Below is an example of uploading the preprocessed weights for Mistral-7B-Instruct-v0.1
with tensor parallelism degree 1
to Anyscale-managed artifact storage.
(base) ray@ip-10-0-58-56:~/default$ aws s3 sync /mnt/local_storage/Mistral-7B-Instruct-v0.1/ $ANYSCALE_ARTIFACT_STORAGE/Mistral-7B-Instruct-v0.1/
upload: ../../../mnt/local_storage/Mistral-7B-Instruct-v0.1/rank-00000.safetensors to s3://anyscale-test-data-cld-i2w99rzq8b6lbjkke9y94vi5/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/Mistral-7B-Instruct-v0.1/rank-00000.safetensors
Updating the RayLLM config to use preprocessed model weights
Finally, update the RayLLM config to use preprocessed model weights for fast loading.
If you don't already have a config file, you can generate one automatically using rayllm gen-config
.
In the RayLLM config file, locate the model_loading_config
and add an anytensor_config
entry like the example below.
The model_path
should be the cloud storage bucket prefix that you uploaded the preprocessed model weights to.
You can use the anyscale://
prefix can as an alias for the artifact storage bucket for the cloud.
model_loading_config:
model_id: mistralai/Mistral-7B-Instruct-v0.1
model_source: mistralai/Mistral-7B-Instruct-v0.1
anytensor_config:
# Use preprocessed model weights in the artifact storage bucket for fast loading.
model_path: anyscale://Mistral-7B-Instruct-v0.1/
Now, run the RayLLM config locally using serve run
or deploy it as a service using anyscale service deploy
.
You should now see log messages like the following indicating that RayLLM is using fast model loading.
(EndpointsVLLMEngine pid=14902) 2024-11-18 19:43:01,962 anytensor INFO - Got 1 file to download (13.49 GB total)
(EndpointsVLLMEngine pid=14902) 2024-11-18 19:43:09,974 anytensor INFO - Finished download in 7.99s (1.69 GB/s)
If you see an error message saying that the weights can't be, check that the provided path in the config exists and that you have uploaded weights for the configured tensor parallelism degree.