Skip to main content

Ray Data Databricks API

DatabricksDatasource

DatabricksDatasource(
server_hostname: str,
http_path: str,
access_token: str,
catalog: Optional[str] = None,
schema: Optional[str] = None,
)

A Datasource that reads and writes to Databricks.

Parameters

  • server_hostname: The server hostname for the cluster or SQL warehouse.
  • http_path: The HTTP path of the cluster or SQL warehouse.
  • access_token: Your Databricks personal access token for the workspace for the cluster or SQL warehouse.
  • catalog: Initial catalog to use for the connection. Defaults to None.
  • schema: Initial schema to use for the connection. Defaults to None.

For detailed instructions on acquiring Databricks connection parameters, read Get started in the Databricks SQL Connector documentation.

Examples

from ray_extensions.data import DatabricksDatasource

datasource = DatabricksDatasource(
server_hostname="dbc-a1b2345c-d6e7.cloud.databricks.com",
http_path="/sql/1.0/warehouses/a1b234c567d8e9fa",
access_token="dbapi...",
)

ray.data.read_datasource

ray.data.read_datasource(
datasource: DatabricksDatasource,
*,
sql: str
) -> Dataset

Read data from a Databricks table into a Ray Dataset.

Parameters

  • datasource: A DatabricksDatasource.
  • sql: The SQL query you want to execute.

Returns

A Ray Dataset that contains the query result set.

Examples

import ray
from ray_extensions.data import DatabricksDatasource

datasource = DatabricksDatasource(
server_hostname="dbc-a1b2345c-d6e7.cloud.databricks.com",
http_path="/sql/1.0/warehouses/a1b234c567d8e9fa",
access_token="dbapi...",
)
ds = ray.data.read_datasource(
datasource,
sql="SELECT * FROM samples.tpch.supplier"
)

Dataset.write_datasource

Dataset.write_datasource(
datasource: DatabricksDatasource,
*,
table: str,
stage_uri: str
) -> None

Write data in a Ray Dataset to a Databricks table.

info

Your Databricks cluster or warehouse needs to read the bucket specified by stage_uri. To configure access to the bucket, read Configure S3 access with instance profiles.

Parameters

  • datasource: A DatabricksDatasource.
  • table: The table you want to write to.
  • stage_uri: The URI of an S3 bucket where Ray can temporarily stage files.

Examples

import ray
from ray_extensions.data import DatabricksDatasource

datasource = DatabricksDatasource(
server_hostname="dbc-a1b2345c-d6e7.cloud.databricks.com",
http_path="/sql/1.0/warehouses/a1b234c567d8e9fa",
access_token="dbapi...",
)
ds = ray.data.from_items([
{"title": "Monty Python and the Holy Grail", "year": 1975, "score": 8.2},
{"title": "And Now for Something Completely Different", "year": 1971, "score" 7.5},
])
ds.write_datasource(
datasource,
table="my_catalog.my_schema.movies",
stage_uri="s3://ray-staging-bucket"
)