Overview of Metrics

Metric Categories

The continuous-eval package offers three categories of metrics based on how they are computed:

Deterministic metrics: calculated based on statistical formulas
Semantic: calculated using smaller models
LLM-based: calculated by an Evaluation LLM with curated prompts

All the metrics comes with pros and cons and there’s not a one-size-fits-all evaluation pipeline that’s optimal for every use case. We aim to provide a wide range of metrics for you to choose from.

The package also offers a way to Ensemble Metrics of different metrics to improve performance on quality and effeciency.

`Metric` Class

Below is the list of metrics available:

Module	Category	Metrics
Retrieval	Deterministic	PrecisionRecallF1, RankedRetrievalMetrics, TokenCount
Retrieval	LLM-based	LLMBasedContextPrecision, LLMBasedContextCoverage
Text Generation	Deterministic	DeterministicAnswerCorrectness, DeterministicFaithfulness, FleschKincaidReadability
	Semantic	DebertaAnswerScores, BertAnswerRelevance, BertAnswerSimilarity
	LLM-based	LLMBasedFaithfulness, LLMBasedAnswerCorrectness, LLMBasedAnswerRelevance, LLMBasedStyleConsistency
Classification	Deterministic	ClassificationAccuracy
Code Generation	Deterministic	CodeStringMatch, PythonASTSimilarity
Code Generation	LLM-based	LLMBasedCodeGeneration
Agent Tools	Deterministic	ToolSelectionAccuracy
Custom		Define your own metrics

Retrieval metrics

Deterministic

PrecisionRecallF1

Definition: Rank-agnostic metrics including Precision, Recall, and F1 of Retrieved Contexts
Inputs: retrieved_context, ground_truth_context

RankedRetrievalMetrics

Definition: Rank-aware metrics including Mean Average Precision (MAP), Mean Reciprical Rank (MRR), NDCG (Normalized Discounted Cumulative Gain) of retrieved contexts
Inputs: retrieved_context, ground_truth_context

TokenCount

Definition: Counts the amount of tokens in the retrieved context.
Inputs: retrieved_context

LLM-based

LLMBasedContextPrecision

Definition: Precision and Mean Average Precision (MAP) based on context relevancy classified by LLM
Inputs: question, retrieved_context

LLMBasedContextCoverage

Definition: Proportion of statements in ground truth answer that can be attributed to Retrieved Contexts calculated by LLM
Inputs: question, retrieved_context, ground_truth_answers

Text Generation metrics

Deterministic

DeterministicAnswerRelevance

Definition: Includes Token Overlap (Precision, Recall, F1), ROUGE-L (Precision, Recall, F1), and BLEU score of Generated Answer vs. Ground Truth Answer
Inputs: question, generated_answer

DeterministicFaithfulness

Definition: Proportion of sentences in Answer that can be matched to Retrieved Contexts using ROUGE-L precision, Token Overlap precision, and BLEU score
Inputs: retrieved_context, generated_answer

FleschKincaidReadability

Definition: How easy or difficult it is to understand the LLM generated answer.
Inputs: generated_answer

Semantic

DebertaAnswerScores

Definition: Entailment and contradiction scores between the Generated Answer and Ground Truth Answer
Inputs: generated_answer, ground_truth_answers

BertAnswerRelevance

Definition: Similarity score based on the BERT model between the Generated Answer and Question
Inputs: question, generated_answer

BertAnswerSimilarity

Definition: Similarity score based on the BERT model between the Generated Answer and Ground Truth Answer
Inputs: generated_answer, ground_truth_answers

LLM-based

LLMBasedFaithfulness

Definition: Binary classifications of whether the statements in the Generated Answer can be attributed to the Retrieved Contexts by LLM
Inputs: question, retrieved_context, generated_answer

LLMBasedAnswerCorrectness

Definition: Overall correctness of the Generated Answer based on the Question and Ground Truth Answer calculated by LLM
Inputs: question, generated_answer, ground_truth_answers

LLMBasedAnswerRelevance

Definition: Relevance of the Generated Answer with respect to the Question
Inputs: question, generated_answer

LLMBasedStyleConsistency

Definition: Consistency of style between the Generated Answer and the Ground Truth Answer(s)
Inputs: generated_answer, ground_truth_answers

Code Generation metrics

Deterministic

DeterministicAnswerRelevance

Definition: Includes Token Overlap (Precision, Recall, F1), ROUGE-L (Precision, Recall, F1), and BLEU score of Generated Answer vs. Ground Truth Answer
Inputs: question, generated_answer

DeterministicFaithfulness

Definition: Proportion of sentences in Answer that can be matched to Retrieved Contexts using ROUGE-L precision, Token Overlap precision, and BLEU score
Inputs: retrieved_context, generated_answer

Classification metrics

Deterministic

ClassificationAccuracy

Definition: Proportion of correctly identified items out of the total items
Inputs: predictions, ground_truth_labels

Code Generation metrics

Deterministic

CodeStringMatch

Definition: Exact and fuzzy match scores between generated code strings and the ground truth code strings
Inputs: answer, ground_truths

PythonASTSimilarity

Definition: Similarity of Abstract Syntax Trees (ASTs) for Python code, comparing the generated code to the ground truth code
Inputs: answer, ground_truths

Agent Tools metrics

Deterministic

ToolSelectionAccuracy

Definition: Accuracy of selecting the correct tool(s) for a given task by the agent
Inputs: tools, ground_truths

Overview of Metrics

Metric Categories

Metric Class

Retrieval metrics

Deterministic

LLM-based

Text Generation metrics

Deterministic

Semantic

LLM-based

Code Generation metrics

Deterministic

Classification metrics

Deterministic

Code Generation metrics

Deterministic

Agent Tools metrics

Deterministic

`Metric` Class