Skip to content

Evaluators

Overview

The evaluators command group manages Arize evaluators — automated evaluation pipelines that can use either LLM-based templates or Python code to assess model outputs.

Command Description Client Method
evaluators list List evaluators get_evaluators
evaluators get Get an evaluator by name get_evaluator
evaluators create-template Create an LLM template evaluator create_template_evaluator
evaluators create-code Create a Python code evaluator create_code_evaluator
evaluators edit Edit evaluator metadata edit_evaluator
evaluators delete Delete an evaluator delete_evaluator

evaluators list

arize_toolkit evaluators list [--search TEXT] [--name TEXT] [--task-type TYPE]

Lists evaluators with optional filtering.

Options

  • --search (optional) — Search by name substring.
  • --name (optional) — Filter by exact name.
  • --task-type (optional) — Filter by type: template_evaluation or code_evaluation.

Example

arize_toolkit evaluators list
arize_toolkit evaluators list --task-type template_evaluation

evaluators get

arize_toolkit evaluators get NAME

Retrieves full details for an evaluator, including its configuration.

Arguments

  • NAME — The evaluator name.

Example

arize_toolkit --json evaluators get "hallucination-detector"

evaluators create-template

arize_toolkit evaluators create-template NAME --template TEXT --metric-name NAME [OPTIONS]

Creates an LLM-based template evaluator that uses a prompt to evaluate model outputs.

Arguments

  • NAME — Name for the evaluator.

Required Options

  • --template — The prompt template string, or @filepath to read from a file. Use {{variables}} for template substitution.
  • --metric-name — Name for the output metric.

Key Options

  • --commit-message — Version message. Defaults to "Initial version".
  • --description — Evaluator description.
  • --tag — Tags (repeatable).
  • --classification-choices — JSON mapping labels to scores (e.g. '{"Yes":0,"No":1}').
  • --direction — Score direction: maximize or minimize. Defaults to maximize.
  • --data-granularity — Granularity: span, trace, or session. Defaults to span.
  • --include-explanations / --no-explanations — Include LLM explanations. Defaults to enabled.
  • --use-function-calling / --no-function-calling — Use function calling. Defaults to disabled.
  • --llm-integration-name — LLM integration name.
  • --llm-model-name — LLM model name (e.g. gpt-4o).

Example

# Inline template
arize_toolkit evaluators create-template "hallucination-detector" \
    --template "Does the response contain factual errors?\n\nContext: {{context}}\nResponse: {{output}}" \
    --metric-name hallucination_score \
    --classification-choices '{"Yes": 0, "No": 1}' \
    --description "Detects hallucinations in LLM responses"

# Template from file
arize_toolkit evaluators create-template "relevance-eval" \
    --template @templates/relevance.txt \
    --metric-name relevance_score \
    --llm-model-name gpt-4o

evaluators create-code

arize_toolkit evaluators create-code NAME --metric-name NAME --code TEXT --evaluation-class CLASS --span-attribute ATTR... [OPTIONS]

Creates a Python code evaluator. The code must define a class that extends CodeEvaluator with an evaluate method.

Arguments

  • NAME — Name for the evaluator.

Required Options

  • --metric-name — Name for the output metric.
  • --code — Python code string, or @filepath to read from a file.
  • --evaluation-class — The class name in the code block.
  • --span-attribute — Span attributes to pass as inputs (repeatable, e.g. --span-attribute output --span-attribute input).

Key Options

  • --commit-message — Version message. Defaults to "Initial version".
  • --description — Evaluator description.
  • --tag — Tags (repeatable).
  • --data-granularity — Granularity: span, trace, or session. Defaults to span.
  • --package-imports — Python import statements.

Example

arize_toolkit evaluators create-code "response-length" \
    --metric-name response_length \
    --code @evaluators/length_check.py \
    --evaluation-class ResponseLengthEvaluator \
    --span-attribute output \
    --description "Checks response length"

evaluators edit

arize_toolkit evaluators edit EVALUATOR_ID [--name NAME] [--description TEXT] [--tag TAG]...

Edits an evaluator's metadata. Does not create a new version.

Arguments

  • EVALUATOR_ID — The evaluator ID.

Options

  • --name (optional) — Updated name.
  • --description (optional) — Updated description.
  • --tag (optional) — Updated tags (repeatable).

Example

arize_toolkit evaluators edit "eval-123" --name "hallucination-v2" --tag production

evaluators delete

arize_toolkit evaluators delete EVALUATOR_ID [--yes]

Deletes an evaluator. Prompts for confirmation unless --yes is passed.

Arguments

  • EVALUATOR_ID — The evaluator ID.

Options

  • --yes — Skip confirmation.

Example

arize_toolkit evaluators delete "eval-123" --yes