The continuous-eval package offers three categories of metrics based on how they are computed:
Deterministic metrics: calculated based on statistical formulas
Semantic: calculated using smaller models
Probabilistic: calculated by an Evaluation LLM with curated prompts
All the metrics comes with pros and cons and there’s not a one-size-fits-all evaluation pipeline that’s optimal for every use case. We aim to provide a wide range of metrics for you to choose from.
Using a metric
There are two ways to use a metric: Directly or through a pipeline.
1. Directly
Each metric has a __call__ method that takes a dictionary of data and returns a dictionary of results.
Additionally, each metric has a args, schema and help properties that describe the metric.
The property args is a dictionary of arguments that can be passed to the metric
The property schema is a dictionary of arguments that can be passed to the metric
And finally, the property help is a string that describes the metric
2. Through a pipeline
This example shows how to use a metric through a pipeline, which is the recommended way when you want to evaluate over a dataset.
Note that it is important to place the code that uses the metric inside the __main__ block, otherwise the multiprocessing evaluation will not work (and will fall back to the single-process evaluation).;