Overview of Metrics
Metric Categories
The continuous-eval
package offers three categories of metrics based on how they are computed:
- Deterministic metrics: calculated based on statistical formulas
- Semantic: calculated using smaller models
- Probabilistic: calculated by an Evaluation LLM with curated prompts
All the metrics comes with pros and cons and there’s not a one-size-fits-all evaluation pipeline that’s optimal for every use case. We aim to provide a wide range of metrics for you to choose from.
Using a metric
There are two ways to use a metric: Directly or through a pipeline.
1. Directly
Each metric has a __call__
method that takes a dictionary of data and returns a dictionary of results.
from continuous_eval.metrics.retrieval import PrecisionRecallF1
datum = { "question": "What is the capital of France?", "retrieved_context": [ "Paris is the capital of France and its largest city.", "Lyon is a major city in France.", ], "ground_truth_context": ["Paris is the capital of France."], "answer": "Paris", "ground_truth_answers": ["Paris"],}
metric = PrecisionRecallF1()
print(metric(**datum))
Additionally, each metric has a args
, schema
and help
properties that describe the metric.
The property args
is a dictionary of arguments that can be passed to the metric
>> print(metric.args){ 'retrieved_context': Arg(type=typing.List[str], description='', is_required=True, default=None), 'ground_truth_context': Arg(type=typing.List[str], description='', is_required=True, default=None)}
The property schema
is a dictionary of arguments that can be passed to the metric
>> print(metric.schema){ {'context_precision': Field(type=<class 'float'>, limits=(0.0, 1.0), internal=False, description=None), 'context_recall': Field(type=<class 'float'>, limits=(0.0, 1.0), internal=False, description=None), 'context_f1': Field(type=<class 'float'>, limits=(0.0, 1.0), internal=False, description=None)}
And finally, the property help
is a string that describes the metric
>> print(metric.help)"Calculate the precision, recall, and f1 score for the retrieved context given the ground truth context."
2. Through a pipeline
This example shows how to use a metric through a pipeline, which is the recommended way when you want to evaluate over a dataset.
from continuous_eval.data_downloader import example_data_downloaderfrom continuous_eval.eval import EvaluationRunner, SingleModulePipelinefrom continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics
if __name__ == "__main__": # Let's download the retrieval dataset example dataset = example_data_downloader("retrieval")
# Define the pipeline (system under test) pipeline = SingleModulePipeline( dataset=dataset, eval=[ PrecisionRecallF1().use( retrieved_context=dataset.retrieved_contexts, ground_truth_context=dataset.ground_truth_contexts, ), RankedRetrievalMetrics().use( retrieved_context=dataset.retrieved_contexts, ground_truth_context=dataset.ground_truth_contexts, ), ], )
# We start the evaluation runner and run the metrics over the downloaded dataset evalrunner = EvaluationRunner(pipeline) metrics = evalrunner.evaluate(dataset) print(metrics.aggregate())
Note that it is important to place the code that uses the metric inside the __main__
block, otherwise the multiprocessing evaluation will not work (and will fall back to the single-process evaluation).;
More examples
You can find more examples in the examples folder or in the example repository.