Probabilistic LLM metrics are LLM-as-a-Judge metrics that provide score distributions with their associated confidence levels, enabling assessment of model certainty in its evaluations. These distributions are derived from the model’s token-level log probabilities.
Custom Probabilistic Metric
Similar to the custom LLM-as-a-Judge metric, you can define your own probabilistic metric by extending the ProbabilisticCustomMetric class.
Optionally, you can also add examples to the metric.
Note: See the limitations section for more information about the response format.
Example Output
Define a new Probabilistic Metric
Sometimes the criteria, rubric and examples are not enough to define the metric. In this case, you can define your own probabilistic metric by extending the ProbabilisticMetric class.
Classification
Example Output
Integer Scoring
In this case, the metric returns a probability over integer values. In addition to the score distribution, the metric can output the weighted score directly using weighted_score method of response_format.
Example Output
Current limitations
The response_format must be a single token value, we predefined a few: GoodOrBad, YesOrNo, Boolean and Integer, but it is possible to define your own. In case of integer scoring, negative values are not supported (they are two tokens) as well as values greater than 9.
Arbitrary JSON format is not supported yet for probabilistic metrics.
At the moment, only OpenAI models are supported for probabilistic metrics.