Configure structured output for LLMs
Learn how to deploy models with structured output patterns, including JSON schemas, guided choices, regular expressions, and grammar-based constraints.
To use full tool and function calling support, install ray>=2.49.0
. Most features aren't available in previous versions (ray<=2.48.0
), but you can still use JSON outputs, specifically the JSON schema type pattern and the JSON object type pattern.
Understand structured output
When you build applications, unstructured LLM outputs can be difficult to extract and process reliably. Structured outputs let you enforce a specific format (such as JSON), removing ambiguity and making it easier to integrate responses into downstream systems.
Use structured output when your application requires predictable fields or values.
Compared to unstructured text, structured output:
- Follows a fixed schema (for example, JSON, regular expressions, choices) instead of freeform language.
- Minimizes post-processing; responses are machine-readable by design.
- Ensures consistency between responses.
Review supported output formats
Since Ray Serve LLM uses vLLM as the inference engine, you can guide models to produce structured outputs using vLLM's built-in output formats.
Format | Description | Use case example |
---|---|---|
JSON output | Returns structured outputs using a JSON schema. | Schema consistency, configuration generation, API responses. |
Choices | Restricts output to a predefined list of values. | Classification tasks, form field selection. |
Regular expressions | Validates output format using regular expression patterns. | Dates, phone numbers, formatted strings, IDs. |
Grammar | Defines rules using EBNF grammar. | SQL queries, code snippets, templates, custom DSLs. |
Structural tags | Applies schema constraints to parts of the response. | Structured function calls, markup-like outputs, tool use, XML-style integration. |
Not all models support every output format. To verify which structured formats your model supports, see the vLLM compatibility matrix.
Generate JSON output
Use the JSON output mode to enforce a JSON schema during generation.
This is especially useful when you build applications that require consistent, machine-readable outputs, such as configuration generation, API responses, or data extraction pipelines.
The following are two ways to get your model to return JSON outputs:
- OpenAI supports
response_format.type="json_object"
and later improved it withresponse_format.type="json_schema"
. - vLLM supports the same
response_format
interface, but also offers an earlier format withguided_json
.
Use JSON schema type (recommended)
The most reliable way to enforce a specific schema in your model's output is by using the response_format
parameter with type "json_schema"
along with a defined JSON Schema.
Define this schema using a Pydantic model, which is the recommended approach.
To implement JSON schema type:
- Define your schema using a Pydantic model.
- Convert the model to a JSON schema with
.model_json_schema()
. - Pass it to the server in the
response_format
parameter:- Set
type
to"json_schema"
- Set
json_schema
to a dictionary with keysname
andschema
- Set the
schema
to your JSON schema
- Set
The following example uses Pydantic models to define the JSON Schema:
from openai import OpenAI
from pydantic import BaseModel
from enum import Enum
client = OpenAI(...)
# 1. Define your schema using a Pydantic model
class CarType(str, Enum):
sedan = "sedan"
suv = "SUV"
truck = "Truck"
coupe = "Coupe"
class CarDescription(BaseModel):
brand: str
model: str
car_type: CarType
# 2. Make a call with the JSON schema
response = client.chat.completions.create(
...
messages=[
{
"role": "user",
"content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
}
],
# 3. Set a `response_format` of type `json_schema` and define the schema there
response_format= {
"type": "json_schema",
# 4. Provide both `name`and `schema` (required)
"json_schema": {
"name": "car-description",
"schema": CarDescription.model_json_schema() # Convert the pydantic model to a JSON schema
},
}
)
Output:
{
"brand": "Lexus",
"model": "IS F",
"car_type": "SUV"
}
Use guided JSON
This feature isn't enabled for versions of ray<=2.48.0
.
Verify compatibility in the vLLM compatibility matrix.
Another method to make your model strictly follow a schema is to use guided_json
, a vLLM-specific parameter that enforces structured output using decoding backends such as XGrammar or guidance during generation.
As with the previous method, use Pydantic models to define your schemas.
Because guided_json
isn't part of the OpenAI API, you must pass it as an extra_body
parameter so the vLLM engine can intercept and enforce it.
To implement guided JSON:
- Define your schema using a Pydantic model.
- Convert the model to a JSON schema with
.model_json_schema()
. - Pass it to the server using
extra_body={"guided_json": ...}
in your OpenAI client call.
from openai import OpenAI
from pydantic import BaseModel
from enum import Enum
client = OpenAI(...)
# 1. Define your schema using a Pydantic model
class CarType(str, Enum):
sedan = "sedan"
suv = "SUV"
truck = "Truck"
coupe = "Coupe"
class CarDescription(BaseModel):
brand: str
model: str
car_type: CarType
# 2. Make a call with the JSON schema
response = client.chat.completions.create(
...
messages=[
{
"role": "user",
"content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
}
],
# 3. Pass it to the `guided_json` field as an `extra_body` parameter
extra_body={
"guided_json": CarDescription.model_json_schema() # Convert the pydantic model to a JSON schema
}
)
Output:
{
"brand": "Lexus",
"model": "IS F",
"car_type": "Coupe"
}
Use JSON object type
JSON Object mode is an earlier implementation of structured output in the OpenAI API where you define the response format as a JSON Object but don't provide any explicit schema.
The model returns a generic JSON object but doesn't enforce a specific schema, making it simpler for lightweight use cases or undefined schemas. To enforce a specific schema, you can provide one as part of the prompt, but this is less reliable than the previous methods.
Use the preceding JSON Schema methods for stricter format control and validation.
This approach uses the response_format
parameter with type "json_object"
.
from openai import OpenAI
client = OpenAI(...)
response = client.chat.completions.create(
...
messages=[
{
"role": "user",
"content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
}
]
# Set the type of `response_format` to `json_object`
response_format={"type": "json_object"},
)
Output:
{
"brand": "Ford",
"model": "Mustang",
"type": "Muscle Car"
}
Troubleshoot JSON outputs
Even in structured output mode, LLMs may occasionally produce invalid JSON due to minor formatting errors.
The mistakes are often simple. Tools such as json_repair
can often fix minor issues without losing content.
Annotate errors: Include error fields in JSON (for example, "error": "description"
) for troubleshooting.
System prompts: To reduce errors with the JSON Object type method, include a format hint in the prompt:
{"role": "system", "content": "You are a helpful assistant. Always reply in this JSON format: {\"color\": string}"}
Configure guided choices
This feature isn't enabled for versions of ray<=2.48.0
.
If you want the model to choose from a predefined list of options, use guided_choices
, a vLLM-specific parameter that restricts the model's output to a set of predefined choices during generation.
This is especially useful for classification tasks, form field selection, or any situation where the response must match one of a few allowed values.
Because guided_choice
isn't part of the OpenAI API, you must pass it as an extra_body
parameter so the vLLM engine can intercept and enforce it.
To implement guided choices:
- Define your list of valid options.
- Pass it to the server using
extra_body={"guided_choice": ...}
in your OpenAI client call.
from openai import OpenAI
client = OpenAI(...)
# 1. Define the valid choices
choices = ["Purple", "Cyan", "Magenta"]
# 2. Make a call with the choices
response = client.chat.completions.create(
...
messages=[
{"role": "system", "content": "You are a helpful assistant. Always reply with one of the choices provided"},
{"role": "user", "content": "Pick a color"}
],
# 3. Pass it to the `guided_choice` field as an `extra_body`
extra_body={
"guided_choice": choices
}
)
Output:
Purple
Notice how the output follows the required choices. Without guidance, the model might have defaulted to more common options such as red, green, or blue.
Configure guided regular expressions
This feature isn't enabled for versions of ray<=2.48.0
.
Verify compatibility in the vLLM compatibility matrix.
If you want to constrain the model's output to match a specific pattern, use guided_regex
, a vLLM-specific feature that enforces regular expression patterns during generation.
This is especially useful for structured outputs such as dates, phone numbers, formatted strings, IDs, or any pattern that needs to follow a strict format.
Because guided_regex
isn't part of the OpenAI API, you must pass it as an extra_body
parameter so the vLLM engine can intercept and enforce it.
To implement guided regular expressions:
- Define your regular expression pattern.
- Pass it to the server using
extra_body={"guided_regex": ...}
in your OpenAI client call. - (Optional) To reduce errors, add an explicit "stop" extra parameter and make sure the regex pattern ends with it.
from openai import OpenAI
client = OpenAI(...)
# 1. Define a regular expression pattern for a hex color code
email_pattern = r"^customprefix\.[a-zA-Z]+@[a-zA-Z]+\.com\n$"
# 2. Make a call with the regex pattern
response = client.chat.completions.create(
...
messages=[
{"role": "system", "content": "You are a helpful assistant. Always reply following the pattern provided"},
{"role": "user", "content": "Generate an example email address for Alan Turing, who works in Enigma. End your answer with a new line"}
],
# 3. Pass it to the `guided_regex` field as an `extra_body`
# For more reliability, add a `stop` parameter and include it at the end of your pattern
extra_body={
"guided_regex": email_pattern,
"stop": ["\n"]
}
)
Output:
customprefix.alanturing@enigma.com
Notice how the output follows the required pattern, starting with "customprefix.[...]"
. This constraint couldn't have been inferred from the prompt alone.
Configure guided grammar (advanced)
This feature isn't enabled for previous versions of ray<=2.48.0
.
If you want to constrain the model's output to a custom formal grammar, use guided_grammar
, a vLLM-specific feature that uses a grammar-based decoder during generation.
This is especially useful for generating SQL queries, code snippets, templates, custom DSLs, or any pattern that needs to follow a strict format.
Because guided_grammar
isn't part of the OpenAI API, you must pass it as an extra_body
parameter so the vLLM engine can intercept and enforce it.
To implement guided grammar:
- Define your context free EBNF grammar.
- Pass it to the server using
extra_body={"guided_grammar": ...}
in your OpenAI client call.
from openai import OpenAI
client = OpenAI(...)
# 1. Define the grammar
simplified_sql_grammar = """
start: "SELECT " columns " from " table ";"
columns: column (", " column)?
column: "username" | "email" | "*"
table: "users"
"""
# 2. Make a call with the grammar
response = client.chat.completions.create(
...
messages=[
{"role": "system", "content": "Respond with a SQL query using the grammar."},
{"role": "user", "content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.",
}
],
# 3. Pass it to the `guided_grammar` field as an `extra_body`
extra_body={
"guided_grammar": simplified_sql_grammar
}
)
Output:
SELECT username, email from users;
Although the query is simple enough for a modern LLM to handle without guidance, the output still adheres to the grammar by using uppercase "SELECT"
and lowercase "from"
. These formatting rules couldn't have been inferred from the prompt alone; without guidance, the output would probably have had "FROM"
with all caps instead.
Configure structural tag formatting (advanced)
This feature isn't enabled for versions of ray<=2.48.0
.
If you want to constrain the model's output to a custom structural tag formatting, use structural_tag
, a vLLM-specific feature that uses a grammar-based decoder during generation.
In this mode, the LLM can generate freely, but must follow specific structural rules whenever it encounters a trigger token. You define each structure by a start tag, end tag, and schema that constrains the content between them.
This is especially useful for structured function calls, markup-style outputs, tool use, XML-style integration, or any similar pattern.
Because structural_tag
isn't part of the OpenAI API, you must pass it through the response_format
parameter so the vLLM engine can intercept and enforce it.
To implement structural tag formatting, do the following:
- Add clear formatting instructions in your system prompt to ensure your model hits the triggers when you want it.
- Define structure rules with tags (triggers) and their schema.
- Pass them to the server using
response_format={"type": "structural_tag", structures: [...]}
in your OpenAI client call.
#structural_tag.py
from openai import OpenAI
client = OpenAI(...)
# 1. Describe the overall structural constraint in a system prompt
system_prompt = """
You are a helpful assistant.
You can answer user questions and optionally call a function if needed. If calling a function, use the format:
<function=function_name>{"arg1": value1, ...}</function>
Example:
<function=get_weather>{"city": "San Francisco"}</function>
Task:
Start by writing a short description (two sentences max) of the requested city.
Then output a form that uses the tags below, one tag per line.
Finish by writing a short conclusion (two sentences max) on the main touristic things to do there.
Required tag blocks
<city-name>{"value": string}</city-name>
<state>{"value": string}</state>
<main-borough>{"value": string}</main-borough>
<baseball-teams>{"value": [string]}</baseball-teams>
<weather>{"value": string}</weather>
"""
# 2. Define the structural rules to follow (one per field)
structures = [
{ # <city-name>{"value": "Boston"}
"begin": "<city-name>",
"schema": {
"type": "object",
"properties": {"value": {"type": "string"}},
"required": ["value"],
},
"end": "</city-name>",
},
{ # <state>{"value": "MA"}
"begin": "<state>",
"schema": {
"type": "object",
"properties": {"value": {"type": "string"}},
"required": ["value"],
},
"end": "</state>",
},
{ # <main-borough>{"value": "Charlestown"}
"begin": "<main-borough>",
"schema": {
"type": "object",
"properties": {"value": {"type": "string"}},
"required": ["value"],
},
"end": "</main-borough>",
},
{ # <baseball-teams>{"value": ["Red Sox"]}
"begin": "<baseball-teams>",
"schema": {
"type": "object",
"properties": {
"value": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["value"],
},
"end": "</baseball-teams>",
},
{ # <weather>{"value": <function=get_weather>...</function>}
"begin": "<weather>",
"schema": {
"type": "object",
"properties": {
"value": {
"type": "string",
"pattern": r"^<function=get_weather>\{.*\}</function>$"
}
},
"required": ["value"],
},
"end": "</weather>",
},
{ # <function=get_weather>{"city": "San Francisco"}
"begin": "<function=get_weather>",
"schema": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
},
"end": "</function>"
}
]
# 3. Define the trigger(s): whenever the model types "<city-name" etc.,
triggers = ["<city-name", "<state", "<main-borough", "<baseball-teams", "<weather", "<function="]
# 4. Make a call with the structures and triggers
response = client.chat.completions.create(
...
messages=[
{"role": "system", "content": system_prompt},
{
"role": "user",
"content": "Tell me about a city in the east coast of the U.S"
},
],
#
response_format={
"type": "structural_tag",
"structures": structures,
"triggers": triggers,
}
)
Output:
Let's start with a description of the city you're interested in. For this example, we'll consider New York City, which is a vibrant metropolis located on the east coast of the United States.
<city-name>{"value": "New York City"}</city-name>
<state>{"value": "New York"}</state>
<main-borough>{"value": "Manhattan"}</main-borough>
<baseball-teams>{"value": ["New York Yankees", "New York Mets"]}</baseball-teams>
<weather>{"value": "<function=get_weather>{\"city\": \"New York City\"}</function>"}</weather>
New York City, located in the state of New York, is a bustling city known for its iconic landmarks such as the Statue of Liberty, Central Park, and Times Square. It's also home to two major baseball teams: the New York Yankees and the New York Mets. The weather in New York City can vary greatly depending on the season, so it's always a good idea to check the forecast before visiting.
Conclusion:
New York City offers an incredible array of attractions including Broadway shows, museums such as the Metropolitan Museum of Art and the American Museum of Natural History, and world-class dining experiences. Baseball fans won't want to miss visiting Yankee Stadium or watching a game at Citi Field. Whether you prefer exploring the city's diverse neighborhoods or relaxing in Central Park, New York City has something for everyone.
Notice how the model manages to follow nested tag patterns.
Apply best practices
Validate responses
Make sure to validate your model's outputs against your expected format and handle errors gracefully.
response = client.chat.completions.create(...)
output = response.choices[0].message.content
try:
# validate JSON schema
parsed = ColorSchema.model_validate_json(output)
except ValidationError as e:
print("Validation failed:", e)
Enable streaming for responsiveness
Enable stream=True
in your client call to reduce latency, especially with larger responses.
from openai import OpenAI
client = OpenAI(...)
response = client.chat.completions.create(
...
# Enable streaming mode
stream= True
)
# Stream chunk by chunk
for chunk in response:
data = chunk.choices[0].delta.content
if data:
print(data, end="", flush=True)
Streaming mode can lead to incomplete or malformed outputs if system or application errors occur. Make sure to implement proper error handling and fallbacks.
Reduce format drift with deterministic sampling
For structured outputs, use low temperature settings to encourage deterministic behavior and reduce format drift.
Handle reasoning models
Reasoning models generate internal "thoughts" before producing the final structured output. The guided decoding backend waits until the end of the reasoning segment (for example, a closing </think>
tag) before enforcing the structured output.
First, pick an appropriate reasoning parser for your model:
applications:
- name: my-structured-output-app
...
args:
llm_configs:
- model_loading_config:
model_id: my-qwq-32B
model_source: Qwen/QwQ-32B
...
engine_kwargs:
...
reasoning_parser: deepseek-r1 # <-- for QwQ models
If you set an appropriate reasoning parser, the response places the thinking process in the reasoning_content
field and the structured output in content
:
ChatCompletionMessage(
role='assistant',
reasoning_content="Okay, let me think this through step by step. First, Lexus is a brand that...",
content='{"brand": "Lexus", "model": "IS F", "car_type": "SUV"}',
)
Without a reasoning parser, the reasoning text may spill into content
, often wrapped in <think>...</think>
, which can break your structured output parsing.
For details on how to configure a reasoning parser, see Deploy a reasoning LLM: Parse reasoning outputs.
Summary
In this guide, you learned how to enforce structured formats in model outputs using JSON schemas, guided choices, regular expressions, grammars, and structural tags. You also picked up performance tips on troubleshooting, optimization, streaming, and handling reasoning outputs.
To explore related patterns such as function calling or tool use, see LLMs and agentic AI on Anyscale.