Skip to content

AnswerCorrectness

Definition

** Answer Correctness** outputs a score between 0.0 - 1.0 assessing the overall quality of the answer, given the question and ground truth answer.

Scoring rubric in LLM Prompt:

  • 0.0 means that the answer is completely irrelevant to the question.
  • 0.25 means that the answer is relevant to the question but contains major errors.
  • 0.5 means that the answer is relevant to the question and is partially correct.
  • 0.75 means that the answer is relevant to the question and is correct.
  • 1.0 means that the answer is relevant to the question and is correct and complete.

Example Usage

Required data items: question, answer, ground_truths

from continuous_eval.metrics.generation.text import AnswerCorrectness
datum = {
"question": "Who wrote 'Romeo and Juliet'?",
"answer": "Shakespeare wrote 'Romeo and Juliet'",
"ground_truth_answers": [
"William Shakespeare wrote 'Romeo and Juliet",
"William Shakespeare",
"Shakespeare",
"Shakespeare is the author of 'Romeo and Juliet'"
]
}
metric = AnswerCorrectness()
print(metric(**datum))

Sample Output

{
"correctness": 0.9999867895679586,
"reasoning": "The generated answer correctly identifies Shakespeare as the author of 'Romeo and Juliet'.",
}