Skip to main content

Read and Write Data from Databricks

Check your docs version

This version of the Anyscale docs is deprecated. Go to the latest version for up to date information.

Ray Datasets are the standard way to load and exchange data in Ray applications. This guide shows you how to:

  1. Configure credentials required to connect to Databricks
  2. Read data from Databricks table into a Ray Dataset
  3. Write data in a Ray Dataset to a Databricks table

Who can use this feature

The Ray Databricks APIs are only available to Anyscale customers.

info

If you want to access this feature, contact the Anyscale team.

Before you begin

The Ray Databricks APIs depend on Requests and the Databricks SQL Connector for Python. To install them, specify the following PyPI packages in your cluster environment:

databricks-sql-connector
requests

To learn more about installing dependencies into your environment, read Anyscale environment.

Connecting to Databricks

To connect to Databricks, create a DatabricksDatasource and specify:

  • server_hostname: The server hostname of the cluster or SQL warehouse.
  • http_path: The HTTP path of the cluster or SQL warehouse.
  • access_token: A personal access token for the workspace.
from ray_extensions.data import DatabricksDatasource

datasource = DatabricksDatasource(
server_hostname="dbc-a1b2345c-d6e7.cloud.databricks.com,
http_path="/sql/1.0/warehouses/a1b234c567d8e9fa",
access_token="dbapi...",
)

For detailed instructions on acquiring Databricks connection parameters, read Get started in the Databricks SQL Connector documentation.

Reading data from Databricks

To read a Databricks table into a Ray Dataset, call ray.data.read_datasource and specify a SQL query.

import ray
from ray_extensions.data import DatabricksDatasource

datasource = DatabricksDatasource(
server_hostname="dbc-a1b2345c-d6e7.cloud.databricks.com",
http_path="/sql/1.0/warehouses/a1b234c567d8e9fa",
access_token="dbapi...",
)
ds = ray.data.read_datasource(
datasource,
sql="SELECT * FROM samples.tpch.supplier"
)

Writing data to Databricks

First, create an S3 bucket where Ray can stage temporary files. Your Databricks cluster or warehouse will need access to this bucket. To configure access to the bucket, read Configure S3 access with instance profiles.

Once you've created and configured access to a staging bucket, call ray.data.write_datasource and specify the table you'd like to write to, as well as the URI of the staging bucket.

import ray
from ray_extensions.data import DatabricksDatasource

datasource = DatabricksDatasource(
server_hostname="dbc-a1b2345c-d6e7.cloud.databricks.com",
http_path="/sql/1.0/warehouses/a1b234c567d8e9fa",
access_token="dbapi...",
)
ds = ray.data.from_items([
{"title": "Monty Python and the Holy Grail", "year": 1975, "score": 8.2},
{"title": "And Now for Something Completely Different", "year": 1971, "score" 7.5},
])
ds.write_datasource(
datasource,
table="my_catalog.my_schema.movies",
stage_uri="s3://ray-staging-bucket"
)

Next steps

To train a model with data stored in Databricks, visit our Ray Examples library.