Deprecating soon

RayLLM is deprecating soon: The Ray Team is consolidating around open source online inferencing solutions. Ray Serve LLM provides LLM serving solution that makes it easy to deploy and manage a variety of open source LLMs. See the migration guide for transitioning your workflows.

RayLLM deployment options

This guide covers the different methods of deploying RayLLM on Anyscale, including

Deploying in a workspace with the Ray Serve CLI
Deploying a production service from a workspace using the Anyscale CLI / SDK
Deploying a production service from any machine outside of a workspace using the Anyscale CLI / SDK

Deploy within a workspace

note

RayLLM apps deployed with this method cannot be queried from outside of the workspace. This includes the Playground. This method is for quick iteration and testing.

This method lets you run and query RayLLM on a workspace. Use this method for development, testing, and quick iteration on your model with different configurations. You must have access to the Deploy LLM workspace template.

This template creates a workspace with the latest version of RayLLM installed. Use the rayllm gen-config interactive CLI to generate the config files for your desired models. You can also manually create the config files from scratch by following the [RayLLM] API and Services documentation.

Once you have the Serve config file, you can start the service by running the following command:

serve run config.yaml [--non-blocking]

Example config file for Llama-3.1-8B

# File name: config.yaml

applications:
- args:
    llm_configs:

    - model_loading_config:
        model_id: meta-llama/Meta-Llama-3.1-8B-Instruct
        model_source: meta-llama/Meta-Llama-3.1-8B-Instruct

      runtime_env:
        env_vars:
          HUGGING_FACE_HUB_TOKEN: insert_your_hf_token_here

      generation_config:
        prompt_format:
          assistant: "<|start_header_id|>assistant<|end_header_id|>\n\n{instruction}<|eot_id|>"
          bos: <|begin_of_text|>
          default_system_message: ''
          system: "<|start_header_id|>system<|end_header_id|>\n\n{instruction}<|eot_id|>"
          system_in_user: false
          trailing_assistant: "<|start_header_id|>assistant<|end_header_id|>\n\n"
          user: "<|start_header_id|>user<|end_header_id|>\n\n{instruction}<|eot_id|>"
        stopping_sequences:
        - <|end_of_text|>
        - <|eot_id|>

      input_modality: text

      llm_engine: VLLMEngine
      engine_kwargs:
        enable_chunked_prefill: true
        max_num_batched_tokens: 2048
        max_num_seqs: 64
        tokenizer_pool_extra_config:
          runtime_env:
            pip: null
        tokenizer_pool_size: 2
        trust_remote_code: true

      json_mode:
        enabled: false

      lora_config: null
      max_request_context_length: 8192

      accelerator_type: A10G

      tensor_parallelism:
        degree: 1

      deployment_config:
        autoscaling_config:
          target_ongoing_requests: 32
        max_ongoing_requests: 64

  import_path: rayllm:app
  name: llm-endpoint
  route_prefix: /
query_auth_token_enabled: false

Check your service's health by looking at the Ray dashboard's Serve metrics page.

Deploy a production service from a workspace

This method lets you use a workspace to launch a long-running, externally available service that can be queried from any machine. Use this method after you finish tuning your RayLLM config generated by the rayllm gen-config CLI command in the Deploy LLM workspace template. That way, you can launch your LLM as a production service without worrying about the cluster environment, resource specification, or compute configuation.

To deploy a service from a workspace, use the following command:

anyscale service deploy -f config.yaml

This command creates a long-running Anyscale service using the Serve config and the workspace image. You can query the resulting production service from any machine.

note

If the service is deployed without token authentication, Anyscale automatically makes it available on the Playground. You can go to Playground and try out the service from there. However, if token authentication is enabled, you must manually add the service to the Playground.

Deploy a production service from any machine

This method lets you launch a long-running, externally available service from any machine, not just an Anyscale workspace. Use this method to launch RayLLM from your laptop, deploy RayLLM through CI/CD pipelines, and version-control your RayLLM configs.

The machine deploying RayLLM must have an Anyscale token and the Anyscale CLI installed. Take the same config file used when deploying from a workspace, and add the RayLLM image_uri that the Anyscale service should use. You can find the list of RayLLM images here. Start the service using the anyscale service deploy command.

anyscale service deploy -n your_service_name -f config.yaml

Example changes to a config YAML file to deploy from any machine:

# File name: config.yaml

# applications are the same as above.
applications: ...
# Now image_uri is required. In the workspace, this came from the workspace image.
image_uri: localhost:5555/anyscale/endpoints_aica:1.0.0-7352

Deploy within a workspace​

Deploy a production service from a workspace​

Deploy a production service from any machine​

Deploy within a workspace

Deploy a production service from a workspace

Deploy a production service from any machine