Evaluators
Overview
The evaluators command group manages Arize evaluators — automated evaluation pipelines that can use either LLM-based templates or Python code to assess model outputs.
| Command | Description | Client Method |
|---|---|---|
evaluators list |
List evaluators | get_evaluators |
evaluators get |
Get an evaluator by name | get_evaluator |
evaluators create-template |
Create an LLM template evaluator | create_template_evaluator |
evaluators create-code |
Create a Python code evaluator | create_code_evaluator |
evaluators edit |
Edit evaluator metadata | edit_evaluator |
evaluators delete |
Delete an evaluator | delete_evaluator |
evaluators list
arize_toolkit evaluators list [--search TEXT] [--name TEXT] [--task-type TYPE]
Lists evaluators with optional filtering.
Options
--search(optional) — Search by name substring.--name(optional) — Filter by exact name.--task-type(optional) — Filter by type:template_evaluationorcode_evaluation.
Example
arize_toolkit evaluators list
arize_toolkit evaluators list --task-type template_evaluation
evaluators get
arize_toolkit evaluators get NAME
Retrieves full details for an evaluator, including its configuration.
Arguments
NAME— The evaluator name.
Example
arize_toolkit --json evaluators get "hallucination-detector"
evaluators create-template
arize_toolkit evaluators create-template NAME --template TEXT --metric-name NAME [OPTIONS]
Creates an LLM-based template evaluator that uses a prompt to evaluate model outputs.
Arguments
NAME— Name for the evaluator.
Required Options
--template— The prompt template string, or@filepathto read from a file. Use{{variables}}for template substitution.--metric-name— Name for the output metric.
Key Options
--commit-message— Version message. Defaults to"Initial version".--description— Evaluator description.--tag— Tags (repeatable).--classification-choices— JSON mapping labels to scores (e.g.'{"Yes":0,"No":1}').--direction— Score direction:maximizeorminimize. Defaults tomaximize.--data-granularity— Granularity:span,trace, orsession. Defaults tospan.--include-explanations / --no-explanations— Include LLM explanations. Defaults to enabled.--use-function-calling / --no-function-calling— Use function calling. Defaults to disabled.--llm-integration-name— LLM integration name.--llm-model-name— LLM model name (e.g.gpt-4o).
Example
# Inline template
arize_toolkit evaluators create-template "hallucination-detector" \
--template "Does the response contain factual errors?\n\nContext: {{context}}\nResponse: {{output}}" \
--metric-name hallucination_score \
--classification-choices '{"Yes": 0, "No": 1}' \
--description "Detects hallucinations in LLM responses"
# Template from file
arize_toolkit evaluators create-template "relevance-eval" \
--template @templates/relevance.txt \
--metric-name relevance_score \
--llm-model-name gpt-4o
evaluators create-code
arize_toolkit evaluators create-code NAME --metric-name NAME --code TEXT --evaluation-class CLASS --span-attribute ATTR... [OPTIONS]
Creates a Python code evaluator. The code must define a class that extends CodeEvaluator with an evaluate method.
Arguments
NAME— Name for the evaluator.
Required Options
--metric-name— Name for the output metric.--code— Python code string, or@filepathto read from a file.--evaluation-class— The class name in the code block.--span-attribute— Span attributes to pass as inputs (repeatable, e.g.--span-attribute output --span-attribute input).
Key Options
--commit-message— Version message. Defaults to"Initial version".--description— Evaluator description.--tag— Tags (repeatable).--data-granularity— Granularity:span,trace, orsession. Defaults tospan.--package-imports— Python import statements.
Example
arize_toolkit evaluators create-code "response-length" \
--metric-name response_length \
--code @evaluators/length_check.py \
--evaluation-class ResponseLengthEvaluator \
--span-attribute output \
--description "Checks response length"
evaluators edit
arize_toolkit evaluators edit EVALUATOR_ID [--name NAME] [--description TEXT] [--tag TAG]...
Edits an evaluator's metadata. Does not create a new version.
Arguments
EVALUATOR_ID— The evaluator ID.
Options
--name(optional) — Updated name.--description(optional) — Updated description.--tag(optional) — Updated tags (repeatable).
Example
arize_toolkit evaluators edit "eval-123" --name "hallucination-v2" --tag production
evaluators delete
arize_toolkit evaluators delete EVALUATOR_ID [--yes]
Deletes an evaluator. Prompts for confirmation unless --yes is passed.
Arguments
EVALUATOR_ID— The evaluator ID.
Options
--yes— Skip confirmation.
Example
arize_toolkit evaluators delete "eval-123" --yes