If you haven’t installed continuous-eval, go here.
Run a single metric
Import the metric of your choice (see all metrics) and get the results.
Run evaluation over a dataset
In the following code example, we load an example evaluation dataset retrieval, create a pipeline with one module, and selected two metric groups PrecisionRecallF1, RankedRetrievalMetrics.
The aggregated results are printed in the terminal and the results per datum is saved at metrics_results_retr.jsonl.
Curate a golden dataset
We recommend AI teams invest in curating a high-quality golden dataset (curated domain experts and checked against user data) to properly evaluate and improve the LLM pipeline. The evaluation golden dataset should be diverse enough to capture unique design requirements in each LLM pipeline.
If you don’t have a golden dataset, you can use SimpleDatasetGenerator to create a “silver dataset” as a starting point, upon which you can modify and improve.
Relari offers more custom synthetic dataset generation / augmentation as a service. We have generated granular pipeline-level datasets for SEC Filing, Company Transcript, Coding Agents, Dynamic Tool Use, Enterprise Search, Sales Contracts, Company Wiki, Slack Conversation, Customer Support Tickets, Product Docs, etc. Contact us if you are interested.