Overview of Metrics

Metric Categories

The continuous-eval package offers three categories of metrics based on how they are computed:

Deterministic metrics: calculated based on statistical formulas
Semantic: calculated using smaller models
Probabilistic: calculated by an Evaluation LLM with curated prompts

All the metrics comes with pros and cons and there’s not a one-size-fits-all evaluation pipeline that’s optimal for every use case. We aim to provide a wide range of metrics for you to choose from.

Using a metric

There are two ways to use a metric: Directly or through a pipeline.

1. Directly

Each metric has a __call__ method that takes a dictionary of data and returns a dictionary of results.

from continuous_eval.metrics.retrieval import PrecisionRecallF1

datum = {
    "question": "What is the capital of France?",
    "retrieved_context": [
        "Paris is the capital of France and its largest city.",
        "Lyon is a major city in France.",
    ],
    "ground_truth_context": ["Paris is the capital of France."],
    "answer": "Paris",
    "ground_truth_answers": ["Paris"],
}

metric = PrecisionRecallF1()

print(metric(**datum))

Additionally, each metric has a args, schema and help properties that describe the metric. The property args is a dictionary of arguments that can be passed to the metric

>> print(metric.args)
{
  'retrieved_context': Arg(type=typing.List[str], description='', is_required=True, default=None),
  'ground_truth_context': Arg(type=typing.List[str], description='', is_required=True, default=None)
}

The property schema is a dictionary of arguments that can be passed to the metric

>> print(metric.schema)
{
  {'context_precision': Field(type=<class 'float'>, limits=(0.0, 1.0), internal=False, description=None),
  'context_recall': Field(type=<class 'float'>, limits=(0.0, 1.0), internal=False, description=None),
  'context_f1': Field(type=<class 'float'>, limits=(0.0, 1.0), internal=False, description=None)
}

And finally, the property help is a string that describes the metric

>> print(metric.help)
"Calculate the precision, recall, and f1 score for the retrieved context given the ground truth context."

2. Through a pipeline

This example shows how to use a metric through a pipeline, which is the recommended way when you want to evaluate over a dataset.

from continuous_eval.data_downloader import example_data_downloader
from continuous_eval.eval import EvaluationRunner, SingleModulePipeline
from continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics

if __name__ == "__main__":
    # Let's download the retrieval dataset example
    dataset = example_data_downloader("retrieval")

    # Define the pipeline (system under test)
    pipeline = SingleModulePipeline(
        dataset=dataset,
        eval=[
            PrecisionRecallF1().use(
                retrieved_context=dataset.retrieved_contexts,
                ground_truth_context=dataset.ground_truth_contexts,
            ),
            RankedRetrievalMetrics().use(
                retrieved_context=dataset.retrieved_contexts,
                ground_truth_context=dataset.ground_truth_contexts,
            ),
        ],
    )

    # We start the evaluation runner and run the metrics over the downloaded dataset
    evalrunner = EvaluationRunner(pipeline)
    metrics = evalrunner.evaluate(dataset)
    print(metrics.aggregate())

Note that it is important to place the code that uses the metric inside the __main__ block, otherwise the multiprocessing evaluation will not work (and will fall back to the single-process evaluation).;

More examples

You can find more examples in the examples folder or in the example repository.