Overview
Evaluate the impact of changes to your LLM application before you launch them to production
Overview
Evaluations allow you to assess and compare the performance of changes to your LLM application before launching them to production. This helps you identify performance regressions and ensure new releases will perform as expected with real users.
Common use cases for Evaluations are to assess the performance impact of different models, prompt templates, or RAG configurations. After you launch changes to production, you can track their impact on real users with our Analytics features.
Key Features
Test Set Creation with Versioning
Test Sets are versioned groups of Test Cases that enable you to organize testing scenarios. Each Test Set version represents a snapshot of your Test Cases.
Test Cases
Within each Test Set, you can define Test Cases, each containing a specific model and a set of initial messages to prompt the model with. This allows you to create your evaluations covering different scenarios that address many areas of your conversational AI system's capabilities.
Running Evaluations
When it's time to evaluate a model's performance you should run a Test Set version with 1, 3, 5 or 7 iterations and the selected model will generate assistant responses for the provided input.
Evaluators and Results
Test Sets come with a list of possible Evaluators. These evaluators analyze the output of each Test Case. Evaluators assess the responses and return results categorized as 'Pass,' 'Fail,' or 'Inconclusive.' This granular feedback allows you to pinpoint areas of improvement or weaknesses in your application. We provide both default evaluators testing for common failure cases (e.g. apologizing and refusing to respond), as well as custom LLM evaluators that are user configurable.
If you specify 3, 5 or 7 iterations, the evaluations will be performed on each of the model response generations to give you a stronger signal of performance.
Result Comparison
Result comparison enables you to compare the results of each run across different versions of a Test Set. This side-by-side analysis allows for easy comparison across Test Set versions so you can understand the per-Test Case performance impact of changes you make.
Last updated