Ray Data Databricks API
DatabricksDatasource
DatabricksDatasource(
server_hostname: str,
http_path: str,
access_token: str,
catalog: Optional[str] = None,
schema: Optional[str] = None,
)
A Datasource
that
reads and writes to Databricks.
Parameters
server_hostname
: The server hostname for the cluster or SQL warehouse.http_path
: The HTTP path of the cluster or SQL warehouse.access_token
: Your Databricks personal access token for the workspace for the cluster or SQL warehouse.catalog
: Initial catalog to use for the connection. Defaults toNone
.schema
: Initial schema to use for the connection. Defaults toNone
.
For detailed instructions on acquiring Databricks connection parameters, read Get started in the Databricks SQL Connector documentation.
Examples
from ray_extensions.data import DatabricksDatasource
datasource = DatabricksDatasource(
server_hostname="dbc-a1b2345c-d6e7.cloud.databricks.com",
http_path="/sql/1.0/warehouses/a1b234c567d8e9fa",
access_token="dbapi...",
)
ray.data.read_datasource
ray.data.read_datasource(
datasource: DatabricksDatasource,
*,
sql: str
) -> Dataset
Read data from a Databricks table into a Ray Dataset.
Parameters
datasource
: ADatabricksDatasource
.sql
: The SQL query you want to execute.
Returns
A Ray Dataset that contains the query result set.
Examples
import ray
from ray_extensions.data import DatabricksDatasource
datasource = DatabricksDatasource(
server_hostname="dbc-a1b2345c-d6e7.cloud.databricks.com",
http_path="/sql/1.0/warehouses/a1b234c567d8e9fa",
access_token="dbapi...",
)
ds = ray.data.read_datasource(
datasource,
sql="SELECT * FROM samples.tpch.supplier"
)
Dataset.write_datasource
Dataset.write_datasource(
datasource: DatabricksDatasource,
*,
table: str,
stage_uri: str
) -> None
Write data in a Ray Dataset to a Databricks table.
info
Your Databricks cluster or warehouse needs to read the bucket specified by
stage_uri
. To configure access to the bucket,
read Configure S3 access with instance profiles.
Parameters
datasource
: ADatabricksDatasource
.table
: The table you want to write to.stage_uri
: The URI of an S3 bucket where Ray can temporarily stage files.
Examples
import ray
from ray_extensions.data import DatabricksDatasource
datasource = DatabricksDatasource(
server_hostname="dbc-a1b2345c-d6e7.cloud.databricks.com",
http_path="/sql/1.0/warehouses/a1b234c567d8e9fa",
access_token="dbapi...",
)
ds = ray.data.from_items([
{"title": "Monty Python and the Holy Grail", "year": 1975, "score": 8.2},
{"title": "And Now for Something Completely Different", "year": 1971, "score" 7.5},
])
ds.write_datasource(
datasource,
table="my_catalog.my_schema.movies",
stage_uri="s3://ray-staging-bucket"
)