Ray Serve and RayTurbo Serve
Ray Serve is an open source library for scalable inference APIs.
- Framework-agnostic: Works with PyTorch, TensorFlow, Keras, Scikit-Learn models, or custom Python logic
- Specializes in model composition and multi-model serving in a single service
- LLM-optimized: Streaming responses, dynamic request batching, multi-node/GPU deployment
- Built on Ray Core for distributed scaling across machines
- Provides flexible scheduling support and performance optimizations
RayTurbo Serve provides improved production readiness and developer experience over open source Ray Serve, along with performance optimizations for large-scale workloads and cost savings through replica compaction and spot support such as:
- Fast autoscaling and model loading
- High QPS serving
- Replica compaction
- Zero-downtime incremental rollouts
- Observability
- Multi-AZ services
- Containerized runtime environments
Fast autoscaling and model loading
RayTurbo Serve's fast model loading capabilities and startup time optimizations improve auto-scaling and cluster startup capabilities. In certain experiments, the end-to-end scaling time for Llama-3-70B is 5.1x faster on Anyscale compared to open source Ray.
High QPS serving
RayTurbo provides an optimized version of Ray Serve to achieve up to 54% higher QPS and up-to 3x streaming tokens per second for high traffic serving use-cases.
Replica compaction
RayTurbo Serve migrates replicas into fewer nodes where possible to reduce resource fragmentation and improve hardware utilization. Replica compaction is enabled by default. Learn more in this blog.
Zero-downtime incremental rollouts
RayTurbo allows you to perform incremental rollouts and canary upgrades for robust production service management. Unlike KubeRay and open source Ray Serve, RayTurbo performs upgrades with rollback procedures without requiring 2x the hardware capacity.
Observability
RayTurbo provides custom metric dashboards, log search, tracing, and alerting, for comprehensive observability into your production services. It also has the ability to export logs, metrics, and traces to your observability tooling like Datadog, etc.
Multi-AZ services
RayTurbo enables availability-zone aware scheduling of Ray Serve replicas to provide higher redundancy to availability zone failures.
Containerized runtime environments
RayTurbo configures different container images for different Ray Serve deployments allowing you to prepare dependencies needed per model separately. It comes with all the fast container optimizations included in fast auto-scaling as well as improved security posture over open source Ray Serve, since it doesn’t require installing Podman and running with root permissions.
APIs
See the detailed guides for RayTurbo:
from ray.anyscale.safetensors.torch import load_file
# IMPORTANT: Initialize the model with *empty weights*.
# When using your own `torch.nn.Module`, you can use torch.nn.utils.skip_init, see:
# https://pytorch.org/tutorials/prototype/skip_param_init.html
with init_empty_weights():
model = MistralForCausalLM(
MistralConfig.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", torch_dtype=torch.float16)
)
# Download the model weights directly from the remote location "model_weights_uri" to the GPU.
state_dict: Dict[str, torch.Tensor] = load_file(model_weights_uri, device="cuda")
# Populate the weights in the model class.
self._model.load_state_dict(state_dict, assign=True)