Overview of Metrics
Metric Categories
The continuous-eval
package offers three categories of metrics based on how they are computed:
- Deterministic metrics: calculated based on statistical formulas
- Semantic: calculated using smaller models
- LLM-based: calculated by an Evaluation LLM with curated prompts
All the metrics comes with pros and cons and there’s not a one-size-fits-all evaluation pipeline that’s optimal for every use case. We aim to provide a wide range of metrics for you to choose from.
The package also offers a way to Ensemble Metrics of different metrics to improve performance on quality and effeciency.
Metric
Class
Below is the list of metrics available:
Module | Category | Metrics |
---|---|---|
Retrieval | Deterministic | PrecisionRecallF1, RankedRetrievalMetrics, TokenCount |
LLM-based | LLMBasedContextPrecision, LLMBasedContextCoverage | |
Text Generation | Deterministic | DeterministicAnswerCorrectness, DeterministicFaithfulness, FleschKincaidReadability |
Semantic | DebertaAnswerScores, BertAnswerRelevance, BertAnswerSimilarity | |
LLM-based | LLMBasedFaithfulness, LLMBasedAnswerCorrectness, LLMBasedAnswerRelevance, LLMBasedStyleConsistency | |
Classification | Deterministic | ClassificationAccuracy |
Code Generation | Deterministic | CodeStringMatch, PythonASTSimilarity |
LLM-based | LLMBasedCodeGeneration | |
Agent Tools | Deterministic | ToolSelectionAccuracy |
Custom | Define your own metrics |
Retrieval metrics
Deterministic
PrecisionRecallF1
- Definition: Rank-agnostic metrics including Precision, Recall, and F1 of Retrieved Contexts
- Inputs:
retrieved_context
,ground_truth_context
RankedRetrievalMetrics
- Definition: Rank-aware metrics including Mean Average Precision (MAP), Mean Reciprical Rank (MRR), NDCG (Normalized Discounted Cumulative Gain) of retrieved contexts
- Inputs:
retrieved_context
,ground_truth_context
TokenCount
- Definition: Counts the amount of tokens in the retrieved context.
- Inputs:
retrieved_context
LLM-based
LLMBasedContextPrecision
- Definition: Precision and Mean Average Precision (MAP) based on context relevancy classified by LLM
- Inputs:
question
,retrieved_context
LLMBasedContextCoverage
- Definition: Proportion of statements in ground truth answer that can be attributed to Retrieved Contexts calculated by LLM
- Inputs:
question
,retrieved_context
,ground_truth_answers
Text Generation metrics
Deterministic
DeterministicAnswerRelevance
- Definition: Includes Token Overlap (Precision, Recall, F1), ROUGE-L (Precision, Recall, F1), and BLEU score of Generated Answer vs. Ground Truth Answer
- Inputs:
question
,generated_answer
DeterministicFaithfulness
- Definition: Proportion of sentences in Answer that can be matched to Retrieved Contexts using ROUGE-L precision, Token Overlap precision, and BLEU score
- Inputs:
retrieved_context
,generated_answer
FleschKincaidReadability
- Definition: How easy or difficult it is to understand the LLM generated answer.
- Inputs:
generated_answer
Semantic
DebertaAnswerScores
- Definition: Entailment and contradiction scores between the Generated Answer and Ground Truth Answer
- Inputs:
generated_answer
,ground_truth_answers
BertAnswerRelevance
- Definition: Similarity score based on the BERT model between the Generated Answer and Question
- Inputs:
question
,generated_answer
BertAnswerSimilarity
- Definition: Similarity score based on the BERT model between the Generated Answer and Ground Truth Answer
- Inputs:
generated_answer
,ground_truth_answers
LLM-based
LLMBasedFaithfulness
- Definition: Binary classifications of whether the statements in the Generated Answer can be attributed to the Retrieved Contexts by LLM
- Inputs:
question
,retrieved_context
,generated_answer
LLMBasedAnswerCorrectness
- Definition: Overall correctness of the Generated Answer based on the Question and Ground Truth Answer calculated by LLM
- Inputs:
question
,generated_answer
,ground_truth_answers
LLMBasedAnswerRelevance
- Definition: Relevance of the Generated Answer with respect to the Question
- Inputs:
question
,generated_answer
LLMBasedStyleConsistency
- Definition: Consistency of style between the Generated Answer and the Ground Truth Answer(s)
- Inputs:
generated_answer
,ground_truth_answers
Code Generation metrics
Deterministic
DeterministicAnswerRelevance
- Definition: Includes Token Overlap (Precision, Recall, F1), ROUGE-L (Precision, Recall, F1), and BLEU score of Generated Answer vs. Ground Truth Answer
- Inputs:
question
,generated_answer
DeterministicFaithfulness
- Definition: Proportion of sentences in Answer that can be matched to Retrieved Contexts using ROUGE-L precision, Token Overlap precision, and BLEU score
- Inputs:
retrieved_context
,generated_answer
Classification metrics
Deterministic
ClassificationAccuracy
- Definition: Proportion of correctly identified items out of the total items
- Inputs:
predictions
,ground_truth_labels
Code Generation metrics
Deterministic
CodeStringMatch
- Definition: Exact and fuzzy match scores between generated code strings and the ground truth code strings
- Inputs:
answer
,ground_truths
PythonASTSimilarity
- Definition: Similarity of Abstract Syntax Trees (ASTs) for Python code, comparing the generated code to the ground truth code
- Inputs:
answer
,ground_truths
Agent Tools metrics
Deterministic
ToolSelectionAccuracy
- Definition: Accuracy of selecting the correct tool(s) for a given task by the agent
- Inputs:
tools
,ground_truths