If you haven’t installed continuous-eval, go here.
Run a single metric
Import the metric of your choice (see all metrics) and get the results.
Run evaluation over a dataset
In the following code example, we load an example evaluation dataset retrieval, create a pipeline with one module, and selected two metric groups PrecisionRecallF1, RankedRetrievalMetrics.
The aggregated results are printed in the terminal.
Continuous-eval is designed to support multi-module evaluation. In this case we instead suppose the system is composed by one single module (the retriever) so we can use the SingleModulePipeline class to setup the pipeline.
In the pipeline we added both metrics (i.e., PrecisionRecallF1 and RankedRetrievalMetrics) and tests (i.e., GreaterOrEqualThan on the recall metric). Read more about this in the Metrics and Tests page.
Curate a golden dataset
We recommend AI teams invest in curating a high-quality golden dataset (curated domain experts and checked against user data) to properly evaluate and improve the LLM pipeline. The evaluation golden dataset should be diverse enough to capture unique design requirements in each LLM pipeline.
Relari offers more custom synthetic dataset generation / augmentation as a service. We have generated granular pipeline-level datasets for SEC Filing, Company Transcript, Coding Agents, Dynamic Tool Use, Enterprise Search, Sales Contracts, Company Wiki, Slack Conversation, Customer Support Tickets, Product Docs, etc. Contact us if you are interested.