Serving a Model with Tensor Parallelism
Serving a Model with Tensor Parallelism
This example explores a slightly more complex serving use case in which a model is deployed with various degrees of tensor parallelism (meaning the individual tensors are sharded across multiple GPUs). This example uses Ray Serve along with DeepSpeed and Hugging Face Transformers to deploy GPT-2 across a couple GPUs as an Anyscale service.
Install the Anyscale CLI
pip install -U anyscale
anyscale login
Deploy the service
Clone the example from GitHub.
git clone https://github.com/anyscale/examples.git
cd examples/serve_tensor_parallel
Deploy the service.
anyscale service deploy -f service.yaml
Understanding the example
- Each replica of the model is sharded across a number of
InferenceWorkerRay actors. There aretensor_parallel_size(2 by default) of them per model replica. There is an additional coordinator actor calledInferenceDeployment, which instantiates theInferenceWorkeractors and queries them. - For each model replica, the
InferenceWorkeractors use DeepSpeed to communicate and perform inference. - Ray uses a placement group to reserve colocated resources for all of the actors for a given model. In the case of larger models that span multiple nodes, it is also possible to use placement groups to reserve resources across multiple nodes.
Query the service
The anyscale service deploy command outputs a line that looks like
curl -H "Authorization: Bearer <SERVICE_TOKEN>" <BASE_URL>
From the output, you can extract the service token and base URL. Open query.py and add them to the appropriate fields.
token = <SERVICE_TOKEN>
base_url = <BASE_URL>
Query the model
python query.py
View the service in the services tab of the Anyscale console.
Shutdown
Shutdown your Anyscale Service:
anyscale service terminate -n tp-service