Metrics and Tests
Metrics
When defining modules in the pipeline you can also specify metrics to evaluate the module outputs. The metrics are defined in the eval
attribute of the module definition.
To specify the input to each metric you can use the use
method.
For example, suppose we have a retriever on which we want to use the PrecisionRecallF1
metric.
We can define the retriever as follows:
The PrecisionRecallF1
metric expects two inputs: retrieved_context
and ground_truth_context
.
To use it to evaluate the module we specify that the retrieved_context
is the module’s output while the ground_truth_context
is the dataset’s ground truth context (here we used the dataset field).
The ModuleOutput class is flexible and allows for custom selectors.
Since PrecisionRecallF1
expect a List[str]
as input for both arguments, by specifying ModuleOutput
we assume the module is actually returning a list of strings. Suppose instead that it returns a list of dictionaries where "page_content"
is the key for the text we want to evaluate.
We could specify the output as follows:
the evaluation runner will take care of extracting the relevant data from the module output and the dataset.
Tests
Each module can also have tests to ensure the module is working as expected.
The tests are defined in the tests
attribute of the module definition.
Suppose we want to make sure the average precision of the retriever is greater than 0.8.
The MeanGreaterOrEqualThan
test expects the name of the metric to test, the minimum value, and the name of the test.
The evaluation runner will run the test and report the results.
The output will be: