Configure structured output for LLMs

Learn how to deploy models with structured output patterns, including JSON schemas, guided choices, regular expressions, and grammar-based constraints.

note

To use full tool and function calling support, install ray>=2.49.0. Most features aren't available in previous versions (ray<=2.48.0), but you can still use JSON outputs, specifically the JSON schema type pattern and the JSON object type pattern.

Understand structured output

When you build applications, unstructured LLM outputs can be difficult to extract and process reliably. Structured outputs let you enforce a specific format (such as JSON), removing ambiguity and making it easier to integrate responses into downstream systems.

Use structured output when your application requires predictable fields or values.

Compared to unstructured text, structured output:

Follows a fixed schema (for example, JSON, regular expressions, choices) instead of freeform language.
Minimizes post-processing; responses are machine-readable by design.
Ensures consistency between responses.

Review supported output formats

Since Ray Serve LLM uses vLLM as the inference engine, you can guide models to produce structured outputs using vLLM's built-in output formats.

Format	Description	Use case example
JSON output	Returns structured outputs using a JSON schema.	Schema consistency, configuration generation, API responses.
Choices	Restricts output to a predefined list of values.	Classification tasks, form field selection.
Regular expressions	Validates output format using regular expression patterns.	Dates, phone numbers, formatted strings, IDs.
Grammar	Defines rules using EBNF grammar.	SQL queries, code snippets, templates, custom DSLs.
Structural tags	Applies schema constraints to parts of the response.	Structured function calls, markup-like outputs, tool use, XML-style integration.

caution

Not all models support every output format. To verify which structured formats your model supports, see the vLLM compatibility matrix.

Generate JSON output

Use the JSON output mode to enforce a JSON schema during generation.

This is especially useful when you build applications that require consistent, machine-readable outputs, such as configuration generation, API responses, or data extraction pipelines.

The following are two ways to get your model to return JSON outputs:

OpenAI supports response_format.type="json_object" and later improved it with response_format.type="json_schema".
vLLM supports the same response_format interface, but also offers an earlier format with guided_json.

Use JSON schema type (recommended)

The most reliable way to enforce a specific schema in your model's output is by using the response_format parameter with type "json_schema" along with a defined JSON Schema.

Define this schema using a Pydantic model, which is the recommended approach.

To implement JSON schema type:

Define your schema using a Pydantic model.
Convert the model to a JSON schema with .model_json_schema().
Pass it to the server in the response_format parameter:
- Set type to "json_schema"
- Set json_schema to a dictionary with keys name and schema
- Set the schema to your JSON schema

The following example uses Pydantic models to define the JSON Schema:

from openai import OpenAI
from pydantic import BaseModel
from enum import Enum

client = OpenAI(...)

# 1. Define your schema using a Pydantic model
class CarType(str, Enum):
    sedan = "sedan"
    suv = "SUV"
    truck = "Truck"
    coupe = "Coupe"

class CarDescription(BaseModel):
    brand: str
    model: str
    car_type: CarType

# 2. Make a call with the JSON schema
response = client.chat.completions.create(
    ...
    messages=[
        {
            "role": "user",
            "content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
        }
    ],
    # 3. Set a `response_format` of type `json_schema` and define the schema there
    response_format= {
        "type": "json_schema",
        # 4. Provide both `name`and `schema` (required)
        "json_schema": {
            "name": "car-description",
            "schema": CarDescription.model_json_schema() # Convert the pydantic model to a JSON schema
        },
    }
)

Output:

{
  "brand": "Lexus",
  "model": "IS F",
  "car_type": "SUV"
}

Use guided JSON

note

This feature isn't enabled for versions of ray<=2.48.0.

note

Verify compatibility in the vLLM compatibility matrix.

Another method to make your model strictly follow a schema is to use guided_json, a vLLM-specific parameter that enforces structured output using decoding backends such as XGrammar or guidance during generation.

As with the previous method, use Pydantic models to define your schemas.

Because guided_json isn't part of the OpenAI API, you must pass it as an extra_body parameter so the vLLM engine can intercept and enforce it.

To implement guided JSON:

Define your schema using a Pydantic model.
Convert the model to a JSON schema with .model_json_schema().
Pass it to the server using extra_body={"guided_json": ...} in your OpenAI client call.

from openai import OpenAI
from pydantic import BaseModel
from enum import Enum

client = OpenAI(...)

# 1. Define your schema using a Pydantic model
class CarType(str, Enum):
    sedan = "sedan"
    suv = "SUV"
    truck = "Truck"
    coupe = "Coupe"

class CarDescription(BaseModel):
    brand: str
    model: str
    car_type: CarType

# 2. Make a call with the JSON schema
response = client.chat.completions.create(
    ...
    messages=[
        {
            "role": "user",
            "content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
        }
    ],
    # 3. Pass it to the `guided_json` field as an `extra_body` parameter
    extra_body={
        "guided_json": CarDescription.model_json_schema() # Convert the pydantic model to a JSON schema
    }
)

Output:

{
  "brand": "Lexus",
  "model": "IS F",
  "car_type": "Coupe"
}

Use JSON object type

JSON Object mode is an earlier implementation of structured output in the OpenAI API where you define the response format as a JSON Object but don't provide any explicit schema.

The model returns a generic JSON object but doesn't enforce a specific schema, making it simpler for lightweight use cases or undefined schemas. To enforce a specific schema, you can provide one as part of the prompt, but this is less reliable than the previous methods.

tip

Use the preceding JSON Schema methods for stricter format control and validation.

This approach uses the response_format parameter with type "json_object".

from openai import OpenAI

client = OpenAI(...)

response = client.chat.completions.create(
    ...
    messages=[
        {
            "role": "user",
            "content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
        }
    ]
    # Set the type of `response_format` to `json_object`
    response_format={"type": "json_object"},
)

Output:

{
  "brand": "Ford",
  "model": "Mustang",
  "type": "Muscle Car"
}

Troubleshoot JSON outputs

Even in structured output mode, LLMs may occasionally produce invalid JSON due to minor formatting errors.

The mistakes are often simple. Tools such as json_repair can often fix minor issues without losing content.

Annotate errors: Include error fields in JSON (for example, "error": "description") for troubleshooting.

System prompts: To reduce errors with the JSON Object type method, include a format hint in the prompt:

{"role": "system", "content": "You are a helpful assistant. Always reply in this JSON format: {\"color\": string}"}

Configure guided choices

note

This feature isn't enabled for versions of ray<=2.48.0.

If you want the model to choose from a predefined list of options, use guided_choices, a vLLM-specific parameter that restricts the model's output to a set of predefined choices during generation.

This is especially useful for classification tasks, form field selection, or any situation where the response must match one of a few allowed values.

Because guided_choice isn't part of the OpenAI API, you must pass it as an extra_body parameter so the vLLM engine can intercept and enforce it.

To implement guided choices:

Define your list of valid options.
Pass it to the server using extra_body={"guided_choice": ...} in your OpenAI client call.

from openai import OpenAI

client = OpenAI(...)

# 1. Define the valid choices
choices = ["Purple", "Cyan", "Magenta"]

# 2. Make a call with the choices
response = client.chat.completions.create(
    ...
    messages=[
        {"role": "system", "content": "You are a helpful assistant. Always reply with one of the choices provided"},
        {"role": "user", "content": "Pick a color"}
    ],
    # 3. Pass it to the `guided_choice` field as an `extra_body`
    extra_body={
        "guided_choice": choices
    }
)

Output:

Purple

Notice how the output follows the required choices. Without guidance, the model might have defaulted to more common options such as red, green, or blue.

Configure guided regular expressions

note

This feature isn't enabled for versions of ray<=2.48.0.

note

Verify compatibility in the vLLM compatibility matrix.

If you want to constrain the model's output to match a specific pattern, use guided_regex, a vLLM-specific feature that enforces regular expression patterns during generation.

This is especially useful for structured outputs such as dates, phone numbers, formatted strings, IDs, or any pattern that needs to follow a strict format.

Because guided_regex isn't part of the OpenAI API, you must pass it as an extra_body parameter so the vLLM engine can intercept and enforce it.

To implement guided regular expressions:

Define your regular expression pattern.
Pass it to the server using extra_body={"guided_regex": ...} in your OpenAI client call.
(Optional) To reduce errors, add an explicit "stop" extra parameter and make sure the regex pattern ends with it.

from openai import OpenAI

client = OpenAI(...)

# 1. Define a regular expression pattern for a hex color code
email_pattern = r"^customprefix\.[a-zA-Z]+@[a-zA-Z]+\.com\n$" 

# 2. Make a call with the regex pattern
response = client.chat.completions.create(
    ...
    messages=[
        {"role": "system", "content": "You are a helpful assistant. Always reply following the pattern provided"},
        {"role": "user", "content": "Generate an example email address for Alan Turing, who works in Enigma. End your answer with a new line"}
    ],
    # 3. Pass it to the `guided_regex` field as an `extra_body`
    # For more reliability, add a `stop` parameter and include it at the end of your pattern
    extra_body={
        "guided_regex": email_pattern,
        "stop": ["\n"]
    }
)

Output:

customprefix.alanturing@enigma.com

Notice how the output follows the required pattern, starting with "customprefix.[...]". This constraint couldn't have been inferred from the prompt alone.

Configure guided grammar (advanced)

note

This feature isn't enabled for previous versions of ray<=2.48.0.

If you want to constrain the model's output to a custom formal grammar, use guided_grammar, a vLLM-specific feature that uses a grammar-based decoder during generation.

This is especially useful for generating SQL queries, code snippets, templates, custom DSLs, or any pattern that needs to follow a strict format.

Because guided_grammar isn't part of the OpenAI API, you must pass it as an extra_body parameter so the vLLM engine can intercept and enforce it.

To implement guided grammar:

Define your context free EBNF grammar.
Pass it to the server using extra_body={"guided_grammar": ...} in your OpenAI client call.

from openai import OpenAI

client = OpenAI(...)

# 1. Define the grammar
simplified_sql_grammar = """
start: "SELECT " columns " from " table ";"

columns: column (", " column)?
column: "username" | "email" | "*"

table: "users"
"""

# 2. Make a call with the grammar
response = client.chat.completions.create(
    ...
    messages=[
        {"role": "system", "content": "Respond with a SQL query using the grammar."},
        {"role": "user", "content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.",
        }
    ],
    # 3. Pass it to the `guided_grammar` field as an `extra_body`
    extra_body={
        "guided_grammar": simplified_sql_grammar
    }
)

Output:

SELECT username, email from users;

Although the query is simple enough for a modern LLM to handle without guidance, the output still adheres to the grammar by using uppercase "SELECT" and lowercase "from". These formatting rules couldn't have been inferred from the prompt alone; without guidance, the output would probably have had "FROM" with all caps instead.

Configure structural tag formatting (advanced)

note

This feature isn't enabled for versions of ray<=2.48.0.

If you want to constrain the model's output to a custom structural tag formatting, use structural_tag, a vLLM-specific feature that uses a grammar-based decoder during generation.

In this mode, the LLM can generate freely, but must follow specific structural rules whenever it encounters a trigger token. You define each structure by a start tag, end tag, and schema that constrains the content between them.

This is especially useful for structured function calls, markup-style outputs, tool use, XML-style integration, or any similar pattern.

Because structural_tag isn't part of the OpenAI API, you must pass it through the response_format parameter so the vLLM engine can intercept and enforce it.

To implement structural tag formatting, do the following:

Add clear formatting instructions in your system prompt to ensure your model hits the triggers when you want it.
Define structure rules with tags (triggers) and their schema.
Pass them to the server using response_format={"type": "structural_tag", structures: [...]} in your OpenAI client call.

#structural_tag.py
from openai import OpenAI

client = OpenAI(...)

# 1. Describe the overall structural constraint in a system prompt
system_prompt = """
You are a helpful assistant.

You can answer user questions and optionally call a function if needed. If calling a function, use the format:
<function=function_name>{"arg1": value1, ...}</function>

Example:
<function=get_weather>{"city": "San Francisco"}</function>

Task:
Start by writing a short description (two sentences max) of the requested city.
Then output a form that uses the tags below, one tag per line.
Finish by writing a short conclusion (two sentences max) on the main touristic things to do there.

Required tag blocks
<city-name>{"value": string}</city-name>
<state>{"value": string}</state>
<main-borough>{"value": string}</main-borough>
<baseball-teams>{"value": [string]}</baseball-teams>
<weather>{"value": string}</weather>
"""

# 2. Define the structural rules to follow (one per field)
structures = [
    {  # <city-name>{"value": "Boston"}
        "begin": "<city-name>",
        "schema": {
            "type": "object",
            "properties": {"value": {"type": "string"}},
            "required": ["value"],
        },
        "end": "</city-name>",
    },
    {  # <state>{"value": "MA"}
        "begin": "<state>",
        "schema": {
            "type": "object",
            "properties": {"value": {"type": "string"}},
            "required": ["value"],
        },
        "end": "</state>",
    },
    {  # <main-borough>{"value": "Charlestown"}
        "begin": "<main-borough>",
        "schema": {
            "type": "object",
            "properties": {"value": {"type": "string"}},
            "required": ["value"],
        },
        "end": "</main-borough>",
    },
    {  # <baseball-teams>{"value": ["Red Sox"]}
        "begin": "<baseball-teams>",
        "schema": {
            "type": "object",
            "properties": {
                "value": {
                    "type": "array",
                    "items": {"type": "string"}
                }
            },
            "required": ["value"],
        },
        "end": "</baseball-teams>",
    },
    {  # <weather>{"value": <function=get_weather>...</function>}
        "begin": "<weather>",
        "schema": {
            "type": "object",
            "properties": {
                "value": {
                    "type": "string",
                    "pattern": r"^<function=get_weather>\{.*\}</function>$"
                }
            },
            "required": ["value"],
        },
        "end": "</weather>",
    },
    {  # <function=get_weather>{"city": "San Francisco"}
        "begin": "<function=get_weather>",
        "schema": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"]
        },
        "end": "</function>"
    }
]

# 3. Define the trigger(s): whenever the model types "<city-name" etc.,
triggers = ["<city-name", "<state", "<main-borough", "<baseball-teams", "<weather", "<function="]


# 4. Make a call with the structures and triggers
response = client.chat.completions.create(
    ...
    messages=[
        {"role": "system", "content": system_prompt},
        {
            "role": "user",
            "content": "Tell me about a city in the east coast of the U.S"
        },
    ],
    #
    response_format={
        "type": "structural_tag",
        "structures": structures,
        "triggers": triggers,
    }
)

Output:

Let's start with a description of the city you're interested in. For this example, we'll consider New York City, which is a vibrant metropolis located on the east coast of the United States.

<city-name>{"value": "New York City"}</city-name>
<state>{"value": "New York"}</state>
<main-borough>{"value": "Manhattan"}</main-borough>
<baseball-teams>{"value": ["New York Yankees", "New York Mets"]}</baseball-teams>
<weather>{"value": "<function=get_weather>{\"city\": \"New York City\"}</function>"}</weather>

New York City, located in the state of New York, is a bustling city known for its iconic landmarks such as the Statue of Liberty, Central Park, and Times Square. It's also home to two major baseball teams: the New York Yankees and the New York Mets. The weather in New York City can vary greatly depending on the season, so it's always a good idea to check the forecast before visiting.

Conclusion:
New York City offers an incredible array of attractions including Broadway shows, museums such as the Metropolitan Museum of Art and the American Museum of Natural History, and world-class dining experiences. Baseball fans won't want to miss visiting Yankee Stadium or watching a game at Citi Field. Whether you prefer exploring the city's diverse neighborhoods or relaxing in Central Park, New York City has something for everyone.

Notice how the model manages to follow nested tag patterns.

Apply best practices

Validate responses

Make sure to validate your model's outputs against your expected format and handle errors gracefully.

response = client.chat.completions.create(...)
output = response.choices[0].message.content

try:
    # validate JSON schema
    parsed = ColorSchema.model_validate_json(output)
except ValidationError as e:
    print("Validation failed:", e)

Enable streaming for responsiveness

Enable stream=True in your client call to reduce latency, especially with larger responses.

from openai import OpenAI

client = OpenAI(...)

response = client.chat.completions.create(
    ...
    # Enable streaming mode
    stream= True
)

# Stream chunk by chunk
for chunk in response:
    data = chunk.choices[0].delta.content
    if data:
        print(data, end="", flush=True)

caution

Streaming mode can lead to incomplete or malformed outputs if system or application errors occur. Make sure to implement proper error handling and fallbacks.

Reduce format drift with deterministic sampling

For structured outputs, use low temperature settings to encourage deterministic behavior and reduce format drift.

Handle reasoning models

Reasoning models generate internal "thoughts" before producing the final structured output. The guided decoding backend waits until the end of the reasoning segment (for example, a closing </think> tag) before enforcing the structured output.

First, pick an appropriate reasoning parser for your model:

applications:
- name: my-structured-output-app
  ...
  args:
    llm_configs:
      - model_loading_config:
          model_id: my-qwq-32B
          model_source: Qwen/QwQ-32B
        ...
        engine_kwargs:
          ...
          reasoning_parser: deepseek-r1 # <-- for QwQ models

If you set an appropriate reasoning parser, the response places the thinking process in the reasoning_content field and the structured output in content:

ChatCompletionMessage(
    role='assistant',
    reasoning_content="Okay, let me think this through step by step. First, Lexus is a brand that...",
    content='{"brand": "Lexus", "model": "IS F", "car_type": "SUV"}', 
)

Without a reasoning parser, the reasoning text may spill into content, often wrapped in <think>...</think>, which can break your structured output parsing.

For details on how to configure a reasoning parser, see Deploy a reasoning LLM: Parse reasoning outputs.

Summary

In this guide, you learned how to enforce structured formats in model outputs using JSON schemas, guided choices, regular expressions, grammars, and structural tags. You also picked up performance tips on troubleshooting, optimization, streaming, and handling reasoning outputs.

To explore related patterns such as function calling or tool use, see LLMs and agentic AI on Anyscale.

Understand structured output​

Review supported output formats​

Generate JSON output​

Use JSON schema type (recommended)​

Use guided JSON​

Use JSON object type​

Troubleshoot JSON outputs​

Configure guided choices​

Configure guided regular expressions​

Configure guided grammar (advanced)​

Configure structural tag formatting (advanced)​

Apply best practices​

Validate responses​

Enable streaming for responsiveness​

Reduce format drift with deterministic sampling​

Handle reasoning models​

Summary​

Understand structured output

Review supported output formats

Generate JSON output

Use JSON schema type (recommended)

Use guided JSON

Use JSON object type

Troubleshoot JSON outputs

Configure guided choices

Configure guided regular expressions

Configure guided grammar (advanced)

Configure structural tag formatting (advanced)

Apply best practices

Validate responses

Enable streaming for responsiveness

Reduce format drift with deterministic sampling

Handle reasoning models

Summary