Evaluate the impact of changes to your LLM application before you launch them to production


Evaluations allow you to assess and compare the performance of changes to your LLM application before launching them to production. This helps you identify performance regressions and ensure new releases will perform as expected with real users.

Common use cases for Evaluations are to assess the performance impact of different models, prompt templates, or RAG configurations. After you launch changes to production, you can track their impact on real users with our Analytics features.

Key Features

Test Set Creation with Versioning

Test Sets are versioned groups of Test Cases that enable you to organize testing scenarios. Each Test Set version represents a snapshot of your Test Cases.

Test Cases

Within each Test Set, you can define Test Cases, each containing a specific model and a set of initial messages to prompt the model with. This allows you to create your evaluations covering different scenarios that address many areas of your conversational AI system's capabilities.

Running Evaluations

When it's time to evaluate a model's performance you should run a Test Set version with 1, 3, 5 or 7 iterations and the selected model will generate assistant responses for the provided input.

Evaluators and Results

Test Sets come with a list of possible Evaluators. These evaluators analyze the output of each Test Case. Evaluators assess the responses and return results categorized as 'Pass,' 'Fail,' or 'Inconclusive.' This granular feedback allows you to pinpoint areas of improvement or weaknesses in your application. We provide both default evaluators testing for common failure cases (e.g. apologizing and refusing to respond), as well as custom LLM evaluators that are user configurable.

If you specify 3, 5 or 7 iterations, the evaluations will be performed on each of the model response generations to give you a stronger signal of performance.

Result Comparison

Result comparison enables you to compare the results of each run across different versions of a Test Set. This side-by-side analysis allows for easy comparison across Test Set versions so you can understand the per-Test Case performance impact of changes you make.

Last updated