Metrics and Biases#
Biases in ML systems#
Data Biases#
Sampling bias: Data mismatch with the real distribution
Training a model on data that is convenient for you or ready for you potentially excluding other cases.
Labelling bias: Inaccurate or inconsistent labels
Historical bias: Data collected in the past may not be representative
Recency bias: The recent data might not be representative.
Model Biases#
Confirmation Bias: Model could be biased towards existing beliefs
Overfitting: Model too faithful to the data it was trained on
Feature Bias: Gender, Race, Zipcodes could be correlated and lead to biased models
Algorithmic Bias: Algorithms could inherently be biased towards a class or view
Decision Trees priortize higher information gain. It could lead to unfair predictions that target specific groups.
Adverserial Attacks: Neural Networks can be prone to attacks where the model could be mislead by carefully modifying the input.
Recommender Systems: Can cause filter bubbles where all users will be recommended similar items because of the underlying data already has biases.
Word Embeddings can be misleading as the text could be having a lot of biases. Certain professions may be correlated to certain genders.
Metrics#
Classification#
Predicted Class
+---------------+---------------+--------------+
| | Positive | Negative |
+---------------+---------------+--------------+
Actual Class | Positive | TP (True | FN (False |
| | Positive) | Negative) |
+---------------+---------------+--------------+
| Negative | FP (False | TN (True |
| | Positive) | Negative) |
+---------------+---------------+--------------+
Predicted Class
+---------------+---------------+------------------+
| | Positive | Negative |
+---------------+---------------+------------------+
Actual Class | Positive | (TP) | (FN) |
| | (Sensitivity) | (Type-2 Error) |
+---------------+---------------+------------------+
| Negative | (FP) | (TN) |
| |(Type-1 Error) | (Specificity) |
+---------------+---------------+------------------+
Accuracy: (TP+TN) / (TP+FP+FN+TN)
Precision: TP / (TP+FP)
Recall: TP / (TP+FN)
f1-score: 2 * Precision * Recall / (Precision+Recall)
Sensitivity=Recall
Specificity: True negatives among all negatives = TN/(TN+FP)
ROC: Graph at various thresholds between True Positive Rate (y-axis) vs False Positive Rate (x-axis):
TPR = Sensitivity = Recall
FPR: FP + TN = 1 =>
FPR = 1- TN or 1 - Specificity
FPR= 1 - (TN/TN+FP) = FP / (TN+FP)
ROC Curves are threshold-invariant.
AUC: Area under curve from ROC
Log Loss: Cross entropy loss $\( L(y, \hat{y}) = -\sum_{i=1}^n y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \)$
import numpy as np
def confusion_matrix(y_true, y_pred, threshold=0.5):
y_pred = np.where(y_pred >= 0.5, 1, 0)
mat = np.zeros((2,2))
tp = mat[0,0] = np.sum((y_true == 1) & (y_pred == 1))
tn = mat[1,1] = np.sum((y_true == 0) & (y_pred == 0))
fp = mat[1,0] = np.sum((y_true == 0) & (y_pred == 1))
fn = mat[0,1] = np.sum((y_true == 1) & (y_pred == 0))
return mat
confusion_matrix(np.array([0,1,0,1,1]),np.array([0.3,0.9,0.9,0.6,0.4]))
array([[2., 1.],
[1., 1.]])
Imbalanced datasets#
Regression#
Mean Squared Error
R*2 score
Mean Absolute Error
Mean Absolute Percentage Error
Explained Variance Score
def mse(y_true, y_pred):
return np.mean((y_true-y_pred)**2)
def mae(y_true, y_pred):
return np.mean(np.abs(y_true-y_pred))
def mape(y_true, y_pred):
return np.mean(np.abs(y_true-y_pred) / y_true) * 100
def r2_score(y_true, y_pred):
sse = np.sum((y_true-y_pred) ** 2)
sst = np.sum((y_true-np.mean(y_true)) ** 2)
return 1 - sse/ sst
Retrieval / Ranking#
Precision@ k: Number of Relevant Docs / Number of Retrieved docs @k
Recall @ k: Number of Relevant Docs @ k/ Total number of Relevant docs
def precision_at_k(relevant_docs, retrieved_docs,k):
relevant_in_topk = len(set(relevant_docs).intersection(retrieved_docs[:k]))
return relevant_in_topk / k
def recall_at_k(relevant_docs, retrieved_docs,k):
relevant_in_topk = len(set(relevant_docs).intersection(retrieved_docs[:k]))
return relevant_in_topk / len(relevant_docs)
Mean average Precision: Averages the precision at every position in the ranking where a relevant document appears $\( MAP = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{1}{|R_i|} \sum_{k=1}^{|R_i|} P(k) \right) \)$
Predicted ranking: [D1, D2, D3, D4, D5], documents are assumed to be ranked according to scores
Relevance labels: [1,0,1,1,1]
At Position 1: Precision = 1/1 = 1
Position 2: Precision = 0 (document 2 not relevant)
Position 3: Precision = 2/3
Position 4: 3/4
Position 5: 4/5
Mean average precision = 1/4 x (1+2/3+3/4+4/5)
def average_precision(y_true, y_pred):
relevant = 0
precisions = []
for i in range(len(y_pred)):
if y_true[i] == 1:
# relevant, compute
relevant += 1
precision_at_i = relevant / (i+1)
precisions.append(precision_at_i)
return np.sum(precisions) / len(precisions)
average_precision([1,0,1,1,1],[1,2,3,4,5])
np.float64(0.8041666666666667)
NDCG (Normalized Discounted Cumulative Gain)
Considers the position of the document in the retrieved list $\( DCG_k = \sum_{i=1}^{k} \frac{\text{rel}_i}{\log_2(i + 1)} \)$
At position 1 => 1/ log2 = 1
Position 2 = 1/ log(3)
Position 3 = 1/log(4) = .5
As the position increases, the weightage on the metric decreases
NDCG normalizes DCG by considering ideal rank. $\( nDCG_k = \frac{DCG_k}{IDCG_k} \)$
Ground Truth = [3, 2, 3, 0, 1]
Predicted Scores = [2, 3, 3, 1, 0]
DCG = 2 / log(1+1) + 3 / log(2+1) + 3/ log(3+1) + 1 / log(4+1) + 0/log(5+1)
Ideal DCG: Ideal Rank = [3,3,2,1,0]
IDCG = 3 / log(1+1) + 3 / log(2+1) + 2/ log(3+1) + 1 / log(4+1) + 0/log(5+1)
def dcg_at_k(relevance_scores , k):
top_k_scores = relevance_scores[:k]
denominator = np.log(np.arange(2,2+k))
return np.sum(top_k_scores/ denominator)
def ndcg_at_k(y_true, y_pred, k):
y_true_sorted = np.sort(y_true)
idcg = dcg_at_k(y_true_sorted,k)
order = np.argsort(y_pred)[::-1]
y_true = np.take(y_true, order)
dcg = dcg_at_k(y_true,k)
return dcg / idcg
Mean Resiprocal Rank (MRR) $\( MRR = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{\text{rank}_i} \)$
LLM Metrics#
Evaluating quality of generative text models
Perplexity: How well model predicts a sample of the test dataset. Lower perplexity indicates better performance. $\( PP = \exp \left( -\frac{1}{n} \sum_{i=1}^{n} \log p(x_i) \right) \)\( \)\( L = -\frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik} \log p(x_i = k) \)$
BLEU (Bilingual Evaluation Understudy Score): Measures overlap between generated text and reference text. $\( BS = \exp \left( \sum_{n=1}^{4} w_n \log p_n \right) \)$
Calculate precision for each n-gram
Original: “Quick brown fox”.
Generated: “Quick browns fox”.
P@1: 2/3, P@2: 1/3, P@3: 0/3
BLEU = exp(w1 * log(p@1) + w2 * log(p@2) + w3 * log(p@3))
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) $\( RS = \frac{1}{n} \sum_{i=1}^{n} \frac{\sum_{k=1}^{K} \min (count_{ik}, count_{ik}')}{\sum_{k=1}^{K} count_{ik}'} \)$