Read and Write Data from Databricks
This version of the Anyscale docs is deprecated. Go to the latest version for up to date information.
Ray Datasets are the standard way to load and exchange data in Ray applications. This guide shows you how to:
- Configure credentials required to connect to Databricks
- Read data from Databricks table into a Ray Dataset
- Write data in a Ray Dataset to a Databricks table
Who can use this feature
The Ray Databricks APIs are only available to Anyscale customers.
If you want to access this feature, contact the Anyscale team.
Before you begin
The Ray Databricks APIs depend on Requests and the Databricks SQL Connector for Python. To install them, specify the following PyPI packages in your cluster environment:
databricks-sql-connector
requests
To learn more about installing dependencies into your environment, read Anyscale environment.
Connecting to Databricks
To connect to Databricks, create a DatabricksDatasource
and specify:
server_hostname
: The server hostname of the cluster or SQL warehouse.http_path
: The HTTP path of the cluster or SQL warehouse.access_token
: A personal access token for the workspace.
from ray_extensions.data import DatabricksDatasource
datasource = DatabricksDatasource(
server_hostname="dbc-a1b2345c-d6e7.cloud.databricks.com,
http_path="/sql/1.0/warehouses/a1b234c567d8e9fa",
access_token="dbapi...",
)
For detailed instructions on acquiring Databricks connection parameters, read Get started in the Databricks SQL Connector documentation.
Reading data from Databricks
To read a Databricks table into a Ray Dataset, call ray.data.read_datasource
and
specify a SQL query.
import ray
from ray_extensions.data import DatabricksDatasource
datasource = DatabricksDatasource(
server_hostname="dbc-a1b2345c-d6e7.cloud.databricks.com",
http_path="/sql/1.0/warehouses/a1b234c567d8e9fa",
access_token="dbapi...",
)
ds = ray.data.read_datasource(
datasource,
sql="SELECT * FROM samples.tpch.supplier"
)
Writing data to Databricks
First, create an S3 bucket where Ray can stage temporary files. Your Databricks cluster or warehouse will need access to this bucket. To configure access to the bucket, read Configure S3 access with instance profiles.
Once you've created and configured access to a staging bucket, call
ray.data.write_datasource
and specify the table you'd like to write to, as well
as the URI of the staging bucket.
import ray
from ray_extensions.data import DatabricksDatasource
datasource = DatabricksDatasource(
server_hostname="dbc-a1b2345c-d6e7.cloud.databricks.com",
http_path="/sql/1.0/warehouses/a1b234c567d8e9fa",
access_token="dbapi...",
)
ds = ray.data.from_items([
{"title": "Monty Python and the Holy Grail", "year": 1975, "score": 8.2},
{"title": "And Now for Something Completely Different", "year": 1971, "score" 7.5},
])
ds.write_datasource(
datasource,
table="my_catalog.my_schema.movies",
stage_uri="s3://ray-staging-bucket"
)
Next steps
To train a model with data stored in Databricks, visit our Ray Examples library.