Skip to main content

Deploy SGLang Multi-Node Inference

Deploy SGLang Multi-Node Inference

This example deploys SGLang for multi-node tensor-parallel inference using Ray on Anyscale.

Install the Anyscale CLI

pip install -U anyscale
anyscale login

Clone the example

git clone https://github.com/anyscale/examples.git
cd examples/sglang_inference

Batch inference

Run batch inference as an Anyscale job:

anyscale job submit -f job.yaml

Deploy as a service

Deploy as an HTTP endpoint with Ray Serve:

anyscale service deploy -f service.yaml

Wait for the service to be ready:

anyscale service wait --name sglang-inference --state RUNNING --timeout-s 900

The anyscale service deploy command outputs a line that looks like:

curl -H "Authorization: Bearer <SERVICE_TOKEN>" <SERVICE_URL>

Set the environment variables from this output and query the model:

export SERVICE_URL=<SERVICE_URL>
export SERVICE_TOKEN=<SERVICE_TOKEN>

pip install requests
python query.py

Shutdown the service when done:

anyscale service terminate --name sglang-inference

Understanding the example

  • serve.py uses Ray Serve's placement_group_bundles to reserve GPUs across multiple nodes for tensor-parallel inference.
  • driver_offline.py wraps SGLang in a Ray actor for batch inference.
  • SGLang is imported inside the actor because it initializes CUDA and cannot be imported on CPU-only nodes.
  • The default configuration uses TP=4, PP=2 across 2 nodes (8 GPUs per replica) on A10G GPUs. Other GPU types like L4, L40S, A100, and H100 would also work.
  • The service autoscales from 1-4 replicas based on queue depth. See AutoscalingConfig for tuning.
  • The Dockerfile installs CUDA toolkit and SGLang dependencies on top of the Ray base image.

Environment variables:

Override any variable at deploy/submit time with --env:

VariableDefaultDescription
MODEL_PATHQwen/Qwen3-1.7BHugging Face model ID
TP_SIZE4Tensor parallelism (GPUs per pipeline stage)
PP_SIZE2Pipeline parallelism (number of stages)
NUM_NODES_PER_REPLICA2Nodes per replica