Skip to main content

LLM Dataset API Reference

Customer-hosted cloud features

note

Some features are only available on customer-hosted clouds. Reach out to support@anyscale.com for info.

LLM Dataset CLI

anyscale llm dataset get Alpha

warning

This command is in early development and may change. Users must be tolerant of change.

Usage

anyscale llm dataset get [OPTIONS] NAME

Retrieves metadata about a dataset.

NAME = Name of the dataset

Example usage:

anyscale llm dataset get my_first_dataset

Retrieve the second latest version of the dataset:

anyscale llm dataset get my_first_dataset -v -1

Options

  • --version/-v: Version of the dataset. If a negative integer is provided, the dataset returned is this many versions back of the latest version. Default: Latest version.
  • --project: Name of the Anyscale project that the dataset belongs to. If not provided, all projects will be searched.

Examples

$ anyscale llm dataset get john_doe/viggo/train.jsonl
Dataset(
id='dataset_123',
name='john_doe/viggo/train.jsonl',
filename='train.jsonl',
storage_uri='s3://anyscale-test-data-cld-123/org_123/cld_123/datasets/dataset_123/3/john_doe/viggo/train.jsonl',
version=3,
num_versions=3,
created_at=datetime.datetime(2024, 1, 1, 0, 0, tzinfo=tzutc()),
creator_id='usr_123',
project_id='prj_123',
cloud_id='cld_123',
description=None
)

anyscale llm dataset upload Alpha

warning

This command is in early development and may change. Users must be tolerant of change.

Usage

anyscale llm dataset upload [OPTIONS] DATASET_FILE

Uploads a dataset, or a new version of a dataset, to your Anyscale cloud.

DATASET_FILE = Path to the dataset file to upload

Example usage:

anyscale llm dataset upload path/to/my_dataset.jsonl -n my_first_dataset

anyscale llm dataset upload my_dataset.jsonl -n second_dataset.jsonl

anyscale llm dataset upload my_dataset2.jsonl -n second_dataset.jsonl --description 'added 3 lines'

 NOTE: If you are uploading a new version, have run this from within an Anyscale workspace, and neither --cloud nor --project is provided, the cloud and project of the workspace will be used.

Options

  • --name/-n: Name of a new dataset, or an existing dataset, to upload a new version of.
  • --description: Description of the dataset version.
  • --cloud: Name of the Anyscale cloud to upload a new dataset to. If not provided, the default cloud will be used.
  • --project: Name of the Anyscale project to upload a new dataset to. If not provided, the default project of the cloud will be used.

Examples

$ anyscale llm dataset upload path/to/my_dataset.jsonl -n my_first_dataset

0:00:00 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.1 MB / 5.1 MB Uploading '/path/to/my_dataset.jsonl'

Upload complete!

Dataset(
id='dataset_123',
name='my_first_dataset',
filename='my_dataset.jsonl',
storage_uri='s3://anyscale-test-data-cld-123/org_123/cld_123/datasets/dataset_123/1/my_dataset.jsonl',
version=1,
num_versions=1,
created_at=datetime.datetime(2024, 1, 1, 0, 0, tzinfo=tzutc()),
creator_id='usr_123',
project_id='prj_123',
cloud_id='cld_123',
description=None
)

anyscale llm dataset download Alpha

warning

This command is in early development and may change. Users must be tolerant of change.

Usage

anyscale llm dataset download [OPTIONS] NAME

Downloads a dataset from your Anyscale cloud.

NAME = Name of the dataset to download

Prints the dataset contents to the terminal by default.

Example usage:

anyscale llm dataset download my_first_dataset.jsonl

Save the dataset to a file:

anyscale llm dataset download my_dataset.jsonl -o ~/Downloads/my_dataset.jsonl

Retrieve the second latest version of the dataset:

anyscale llm dataset download my_dataset.jsonl -v -1

Options

  • --version/-v: Version of the dataset to download. If a negative integer is provided, the dataset returned is this many versions back of the latest version. Default: Latest version.
  • --project: Name of the Anyscale project to download the dataset from. If not provided, all projects will be searched.
  • --output/-o: Path to save the downloaded dataset to.If not provided, the dataset contents will be printed to the terminal.

Examples

$ anyscale llm dataset download train.jsonl
0:00:00 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 711.0 kB / 711.0 kB Downloading 'train.jsonl'

Download complete!

{"messages":[{"content":"hi","role":"user"},{"content":"Hi! How can I help?","role":"assistant"}]}
...
{"messages":[{"content":"bye","role":"user"},{"content":"Goodbye!","role":"assistant"}]}

LLM Dataset SDK

anyscale.llm.dataset.get

Retrieves metadata about a dataset.

:param name: Name of the dataset. :param version: Version of the dataset. If a negative integer is provided, the dataset returned is this many versions back of the latest version. Default: Latest version. :param project: Name of the Anyscale project that the dataset belongs to. If not provided, all projects will be searched.

Example usage:

dataset = anyscale.llm.dataset.get("my_first_dataset")
print(f"Dataset name: '{dataset.name}'") # Dataset name: 'my_first_dataset'

# Get the second latest version of the dataset
prev_dataset = anyscale.llm.dataset.get("my_first_dataset", version=-1)

Return: Dataset: The Dataset object.

Arguments

  • name (str): Name of the dataset
  • version (int | None) = None: Version of the dataset. If a negative integer is provided, the dataset returned is this many versions back of the latest version. Default: Latest version.
  • project (str | None) = None: Name of the Anyscale project that the dataset belongs to. If not provided, all projects will be searched.

Returns: Dataset

Examples

import anyscale
from anyscale.llm.dataset import Dataset

dataset: Dataset = anyscale.llm.dataset.get("my_first_dataset")
print(f"Dataset name: '{dataset.name}'") # Dataset name: 'my_first_dataset'

# Get the second latest version of the dataset
prev_dataset = anyscale.llm.dataset.get("my_first_dataset", version=-1)

anyscale.llm.dataset.upload

Uploads a dataset, or a new version of a dataset, to your Anyscale cloud.

:param dataset_file: Path to the dataset file to upload. :param name: Name of a new dataset, or an existing dataset, to upload a new version of. :param description: Description of the dataset version. :param cloud: Name of the Anyscale cloud to upload a new dataset to. If not provided, the default cloud will be used. :param project: Name of the Anyscale project to upload a new dataset to. If not provided, the default project of the cloud will be used.

Example usage:

anyscale.llm.dataset.upload("path/to/my_first_dataset.jsonl", name="my_first_dataset")
anyscale.llm.dataset.upload("my_dataset.jsonl", "second_dataset")
anyscale.llm.dataset.upload("my_dataset2.jsonl", "second_dataset", description="added 3 lines")

Return: Dataset: The Dataset object representing the uploaded dataset.

NOTE: If you are uploading a new version, have run this from within an Anyscale workspace, and neither cloud nor project are provided, the cloud and project of the workspace will be used.

Arguments

  • dataset_file (str): Path to the dataset file to upload.
  • name (str): Name of a new dataset, or an existing dataset, to upload a new version of.
  • description (str | None) = None: Description of the dataset version.
  • cloud (str | None) = None: Name of the Anyscale cloud to upload a new dataset to. If not provided, the default cloud will be used.
  • project (str | None) = None: Name of the Anyscale project to upload a new dataset to. If not provided, the default project of the cloud will be used.

Returns: Dataset

Examples

import anyscale

anyscale.llm.dataset.upload("path/to/my_first_dataset.jsonl", name="my_first_dataset")
anyscale.llm.dataset.upload("my_dataset.jsonl", "second_dataset")
anyscale.llm.dataset.upload("my_dataset2.jsonl", "second_dataset", description="added 3 lines")

anyscale.llm.dataset.download

Downloads a dataset from your Anyscale cloud.

:param name: Name of the dataset to download. :param version: Version of the dataset to download. If a negative integer is provided, the dataset returned is this many versions back of the latest version. Default: Latest version. :param project: Name of the Anyscale project to download the dataset from. If not provided, all projects will be searched.

Example usage:

dataset_contents: bytes = anyscale.llm.dataset.download("my_first_dataset.jsonl")
jsonl_obj = [json.loads(line) for line in dataset_contents.decode().splitlines()]

prev_dataset_contents = anyscale.llm.dataset.download("my_first_dataset.jsonl", version=-1)

Returns: bytes: The contents of the dataset file.

Arguments

  • name (str): Name of the dataset to download.
  • version (int | None) = None: Version of the dataset to download. If a negative integer is provided, the dataset returned is this many versions back of the latest version. Default: Latest version.
  • project (str | None) = None: Name of the Anyscale project to download the dataset from. If not provided, all projects will be searched.

Returns: bytes

Examples

import anyscale

dataset_contents: bytes = anyscale.llm.dataset.download("my_first_dataset.jsonl")
jsonl_obj = [json.loads(line) for line in dataset_contents.decode().splitlines()]

prev_dataset_contents = anyscale.llm.dataset.download("my_first_dataset.jsonl", version=-1)

LLM Dataset Models

Dataset

Metadata about a dataset, which is a file uploaded by a user to their Anyscale cloud.

Fields

  • id (str): The ID of the dataset.
  • name (str): The name of the dataset.
  • filename (str): The file name of the uploaded dataset.
  • storage_uri (str): The URI at which the dataset is stored (eg. s3://bucket/path/to/test.jsonl).
  • version (int): The version of the dataset.
  • num_versions (int): Number of versions of the dataset.
  • created_at (datetime): The time at which the dataset was uploaded.
  • creator_id (str): The ID of the Anyscale user who uploaded the dataset.
  • project_id (str): The ID of the Anyscale project that the dataset belongs to.
  • cloud_id (str): The ID of the Anyscale cloud that the dataset belongs to.
  • description (str | None): The description of the current dataset version.

Python Methods

def to_dict(self) -> Dict[str, Any]
"""Return a dictionary representation of the model."""

Examples

import anyscale
from anyscale.llm.dataset import Dataset

dataset: Dataset = anyscale.llm.dataset.get("my_first_dataset")
print(f"Dataset name: '{dataset.name}'") # Dataset name: 'my_first_dataset'

# Get the second latest version of the dataset
prev_dataset = anyscale.llm.dataset.get("my_first_dataset", version=-1)