Quick Start
If you haven’t installed continuous-eval, go here.
Run a single metric
Import the metric of your choice (see all metrics) and get the results.
from continuous_eval.metrics.retrieval import PrecisionRecallF1
# A dataset is just a list of dictionaries containing the relevant informationdatum = { "question": "What is the capital of France?", "retrieved_context": [ "Paris is the capital of France and its largest city.", "Lyon is a major city in France.", ], "ground_truth_context": ["Paris is the capital of France."], "answer": "Paris", "ground_truths": ["Paris"],}
# Let's initialize the metricmetric = PrecisionRecallF1()
# Let's calculate the metric for the first datumprint(metric(**datum))
Run evaluation over a dataset
In the following code example, we load an example evaluation dataset retrieval
, create a pipeline with one module, and selected two metric groups PrecisionRecallF1
, RankedRetrievalMetrics
.
The aggregated results are printed in the terminal.
from time import perf_counter
from continuous_eval.data_downloader import example_data_downloaderfrom continuous_eval.eval import Dataset, EvaluationRunner, SingleModulePipelinefrom continuous_eval.eval.tests import GreaterOrEqualThanfrom continuous_eval.metrics.retrieval import ( PrecisionRecallF1, RankedRetrievalMetrics,)
def main(): # Let's download the retrieval dataset example dataset_jsonl = example_data_downloader("retrieval") dataset = Dataset(dataset_jsonl)
pipeline = SingleModulePipeline( dataset=dataset, eval=[ PrecisionRecallF1().use( retrieved_context=dataset.retrieved_contexts, # type: ignore ground_truth_context=dataset.ground_truth_contexts, # type: ignore ), RankedRetrievalMetrics().use( retrieved_context=dataset.retrieved_contexts, # type: ignore ground_truth_context=dataset.ground_truth_contexts, # type: ignore ), ], tests=[ GreaterOrEqualThan( test_name="Recall", metric_name="context_recall", min_value=0.8 ), ], )
# We start the evaluation manager and run the metrics tic = perf_counter() runner = EvaluationRunner(pipeline) eval_results = runner.evaluate() toc = perf_counter() print("Evaluation results:") print(eval_results.aggregate()) print(f"Elapsed time: {toc - tic:.2f} seconds\n")
print("Running tests...") test_results = runner.test(eval_results) print(test_results)
if __name__ == "__main__": # It is important to run this script in a new process to avoid # multiprocessing issues main()
Continuous-eval is designed to support multi-module evaluation. In this case we instead suppose the system is composed by one single module (the retriever) so we can use the SingleModulePipeline
class to setup the pipeline.
In the pipeline we added both metrics (i.e., PrecisionRecallF1
and RankedRetrievalMetrics
) and tests (i.e., GreaterOrEqualThan
on the recall metric). Read more about this in the Metrics and Tests page.
Curate a golden dataset
We recommend AI teams invest in curating a high-quality golden dataset (curated domain experts and checked against user data) to properly evaluate and improve the LLM pipeline. The evaluation golden dataset should be diverse enough to capture unique design requirements in each LLM pipeline.
Relari offers more custom synthetic dataset generation / augmentation as a service. We have generated granular pipeline-level datasets for SEC Filing, Company Transcript, Coding Agents, Dynamic Tool Use, Enterprise Search, Sales Contracts, Company Wiki, Slack Conversation, Customer Support Tickets, Product Docs, etc. Contact us if you are interested.