Introduction
Virtually all code depends on other code to function. For example, my simple Python program depends not only on on a Python runtime such as "Python 3.8.3 on MacOS" but also, because of import
statements, it may depend on numpy
, pandas
, or thousands of other packages. Python provides tools for managing these dependencies, but the problem can still be hard. Python developers learn that dependencies can be a nightmare, particularly when package A depends on one version of package B, and package C depends on an incompatible version of package B. This document can't address all issues of dependency management, but it can help you deal with how using Ray affects them.
Since Ray provides a way to link a client computer with a Ray cluster, you'll have to consider the relationship between the Python environment on your laptop, in a CI environment, and on Anyscale. Here are some environments to consider, and how your application lifecycle will lead you from one to another until you've considered an independenly running, production environment.
Local environment
When working with Ray locally, the Ray cluster uses the same Python environment as does the command line. This is the simplest arrangement, and beyond the normal foibles of dependency management, Ray does not provide additional barriers. Let's say I have a very simple script that depends on Ray and pkutils, a utility to parse requirements files. It's not important at all what this program does, simply that it can find its dependencies.
import ray
import pkutils
ray.init()
@ray.remote
def reqs():
return list(pkutils.parse_requirements("requirements.txt"))
print(ray.get(reqs.remote()))
If you've installed Ray and pkutils
then this script will run, otherwise you'll get the dreaded ModuleNotFoundError: No module named 'ray'
message. If you have put your dependencies into a file called requirements.txt
you can ensure they are installed with pip install -r requirements.txt
and then the script will run successfully.
# requirements.txt
anyscale[all]
pkutils
Anyscale environments
Problem statement
Run the same code on Anyscale:
import ray
import pkutils
ray.init("anyscale://deps-demo")
@ray.remote
def reqs():
return list(pkutils.parse_requirements("requirements.txt"))
print(ray.get(reqs.remote()))
When you run code on Anyscale, it appears like a seamless extension of your laptop, but actually the environment on the Anyscale cluster is separate from the one on your laptop. You may get the dreaded ModuleNotFoundError: No module named 'pkutils'
even though your local environment has it.
You need to tell Anyscale about your dependencies.
Cluster environments
Machines that are provisioned by Anyscale come with a certain set of dependencies. Some libraries, in particular the Ray and Anyscale dependencies (see menu on the left under "Reference"), are provided in the base image for any Anyscale-managed cluster. In order to further control dependencies, Anyscale provides cluster environment, which allow you to configure dependencies. Cluster environment dependencies are installed at Docker image build time, so they do not make cluster launch times increase. Dependencies in cluster environments are also immutable. Usually, once someone has established the requirements for an application, they place the dependencies into a cluster Environment to lock them down for the application.
Runtime environments
To tell Anyscale about your dependencies, you may also add a runtime environment to your ray.init()
command. The runtime environment can be a list of dependencies, or, as in this case, a reference to the requirements file:
import ray
import pkutils
ray.init("anyscale://deps-demo", runtime_env={"pip": "requirements.txt"})
@ray.remote
def reqs():
return list(pkutils.parse_requirements("requirements.txt"))
print(ray.get(reqs.remote()))
Now the remote function executes appropriately. You can also specify requirements directly within the Python code. Runtime environments can specify Anaconda, Debian and pip packages. However, keep in mind that dependencies in a runtime environment will be installed on the cluster at launch time, an operation which can extend the time it takes to launch any given node.