Skip to main content

Fast model loading (alpha)

note

This is an alpha feature that is subject to change. Contact Anyscale support with any feedback.

You can use the Anyscale optimized streaming safetensors client with RayLLM to speed up model loading times. For more details on how fast model loading works, see the blog post.

Enabling fast model loading requires three steps:

  1. Downloading and preprocessing the weights for the model.
  2. Uploading the preprocessed model weights to cloud storage.
  3. Updating your RayLLM configuration to use the preprocessed model weights.

Follow along with these steps in an Anyscale workspace.

Downloading and preprocessing model weights

First, use the rayllm anytensor preprocess CLI to download model weights. This command downloads the weights from Hugging Face, preprocesses them into the inference format used by RayLLM, and then writes them to the specified output directory.

Important notes:

  • RayLLM writes the model weights to local disk. Use the local disk mounted by Anyscale automatically at /mnt/local_storage and ensure you have enough disk space.
  • You must perform preprocessing separately for each tensor parallelism degree you plan to use. Specify the degree using the --tp-degree flag.

Below is an example of preprocessing model weights for Mistral-7B-Instruct-v0.1 with tensor parallelism degree 1.

No TP rank provided, generating weights for all ranks of degree 1
Generating weights for TP rank 0 / 0
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 3.76it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.79it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.94it/s]
Writing weights for rank 0 to: /mnt/local_storage/Mistral-7B-Instruct-v0.1/rank-00000.safetensors
Finished writing /mnt/local_storage/Mistral-7B-Instruct-v0.1/rank-00000.safetensors, size: 13.49 GB

Uploading preprocessed model weights to cloud storage

Next, upload the preprocessed model weights to cloud storage. Anyscale manages an artifact storage bucket automatically for each cloud.

Below is an example of uploading the preprocessed weights for Mistral-7B-Instruct-v0.1 with tensor parallelism degree 1 to Anyscale-managed artifact storage.

(base) ray@ip-10-0-58-56:~/default$ aws s3 sync /mnt/local_storage/Mistral-7B-Instruct-v0.1/ $ANYSCALE_ARTIFACT_STORAGE/Mistral-7B-Instruct-v0.1/
upload: ../../../mnt/local_storage/Mistral-7B-Instruct-v0.1/rank-00000.safetensors to s3://anyscale-test-data-cld-i2w99rzq8b6lbjkke9y94vi5/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/Mistral-7B-Instruct-v0.1/rank-00000.safetensors

Updating the RayLLM config to use preprocessed model weights

Finally, update the RayLLM config to use preprocessed model weights for fast loading. If you don't already have a config file, you can generate one automatically using rayllm gen-config.

In the RayLLM config file, locate the model_loading_config and add an anytensor_config entry like the example below. The model_path should be the cloud storage bucket prefix that you uploaded the preprocessed model weights to. You can use the anyscale:// prefix can as an alias for the artifact storage bucket for the cloud.

model_loading_config:
model_id: mistralai/Mistral-7B-Instruct-v0.1
model_source: mistralai/Mistral-7B-Instruct-v0.1
anytensor_config:
# Use preprocessed model weights in the artifact storage bucket for fast loading.
model_path: anyscale://Mistral-7B-Instruct-v0.1/

Now, run the RayLLM config locally using serve run or deploy it as a service using anyscale service deploy. You should now see log messages like the following indicating that RayLLM is using fast model loading.

(EndpointsVLLMEngine pid=14902) 2024-11-18 19:43:01,962 anytensor INFO - Got 1 file to download (13.49 GB total)
(EndpointsVLLMEngine pid=14902) 2024-11-18 19:43:09,974 anytensor INFO - Finished download in 7.99s (1.69 GB/s)

If you see an error message saying that the weights can't be, check that the provided path in the config exists and that you have uploaded weights for the configured tensor parallelism degree.