Metric Ensembling
The aim of ensembling different metrics to predict the human label is to combine the strengths and balance out the weaknesses of individual metrics, ultimately leading to more accurate, robust, and reliable predictions.
Each metric might capture different aspects of the data or be sensitive to different patterns, so when we combine them, we often get a more comprehensive view.
What is Conformal Prediction?
Conformal Prediction is a statistical technique that quantifies the confidence level of a prediction. In this case, we are trying to predict whether the answer is correct (or faithful). With conformal prediction, instead of just saying “yes” (or “no”), the model tells us “the answer is correct with probability at least 90%”. In essence, conformal prediction doesn’t just give you an answer; it tells you how confident you can be in that answer. If the model is uncertain, conformal prediction will tell you it’s “undecided”. For the undecided datapoints, we ask the more powerful GPT-4 to judge its correctness.
Metric Ensembling
The MetricEnsemble
class helps you to ensemble multiple metrics to predict a ground truth label, such us human labels.
The class leverage the conformal prediction technique to compute a reliable
Parameters:
training: XYData
: training data, it should containtraining.X
(the metrics output, also referred as features) andtraining.Y
(the ground truth label)calibration: XYData
: as before but used for the calibration of the conformal predictoralpha: float
: significance level, default to 0.1. The significance level os the probability that a prediction will not be included in the predicted set, serving as a measure of the confidence or reliability of the prediction. For example if alpha is 0.1, then the prediction set will contain the correct label with probability 0.9.random_state: Optional[int]
: random seed, default to None
The MetricEnsemble
class has the following methods:
predict(self, X: pd.DataFrame, judicator: Optional[Callable] = None)
: it takes as input a dataframe of metrics output and returns a dataframe of predictions
The predict
returns two numpy vectors:
y_hat
a binary (1/0) vector with best-effort predictions of the ensembley_set
a binary array of size (N, 2) where the first column is 1 is, for the significance level set byalpha
, the sample can be classified as negative, and the second column is 1 if the sample can be classified as positive.
The set prediction (y_set
) can have both columns set to 1, meaning that the ensemble is undecided.
This happen because the particular choice of metrics in the ensemble is not confident enough or the significance level is too high.
In such cases the predict
method will call the judicator
function (if not None
) to make a final decision.
The judicator
function takes as input the index of the sample where the predictor is undecided and must return a boolean value (True/False) indicating the final decision.
Example
In this exampel we want to use deterministic and semantic metrics to predict the correctness of the answers (as evaluated by a human annotator). When these two metrics alone are not sufficient to produce a confident prediction, we use the LLM to make the final decision.
As first thing we compute the deterministic and semantic metrics:
We now split the samples in train, test, and calibration sets and train the classifier.
Note that we are using only the "token_overlap_recall"
,"deberta_answer_entailment"
, and "deberta_answer_contradiction"
to train the classifier.
Finally we run the classifier and evaluate the results:
The output would be something like:
Using a judicator
Let’s assume we want to use the LLM to make the final decision when the classifier is undecided.
We can define a judicator
function that takes as input the index of the sample where the classifier is undecided and returns a boolean value (True/False) indicating the final decision.
To use the judicator we simply pass it to the predict
method:
The output would be something like:
Here the predict
function called the LLM in the 15.74% of the cases where the classifier was undecided.
The classifier is no longer undecided and the performance improved.