Metrics and Biases

Metrics and Biases#

Biases in ML systems#

Data Biases#

Sampling bias: Data mismatch with the real distribution
- Training a model on data that is convenient for you or ready for you potentially excluding other cases.
Labelling bias: Inaccurate or inconsistent labels
Historical bias: Data collected in the past may not be representative
Recency bias: The recent data might not be representative.

Model Biases#

Confirmation Bias: Model could be biased towards existing beliefs
Overfitting: Model too faithful to the data it was trained on
Feature Bias: Gender, Race, Zipcodes could be correlated and lead to biased models
Algorithmic Bias: Algorithms could inherently be biased towards a class or view
- Decision Trees priortize higher information gain. It could lead to unfair predictions that target specific groups.
- Adverserial Attacks: Neural Networks can be prone to attacks where the model could be mislead by carefully modifying the input.
- Recommender Systems: Can cause filter bubbles where all users will be recommended similar items because of the underlying data already has biases.
- Word Embeddings can be misleading as the text could be having a lot of biases. Certain professions may be correlated to certain genders.

Metrics#

Classification#

              Predicted Class
              +---------------+---------------+--------------+
              |               |  Positive     |  Negative    |
              +---------------+---------------+--------------+
Actual Class  |  Positive     |  TP (True     |   FN (False  |
              |               |  Positive)    |  Negative)   |
              +---------------+---------------+--------------+
              |  Negative     |  FP (False    |  TN (True    |
              |               |  Positive)    |  Negative)   |
              +---------------+---------------+--------------+


              Predicted Class
              +---------------+---------------+------------------+
              |               |  Positive      |  Negative       |
              +---------------+---------------+------------------+
Actual Class  |  Positive     |  (TP)          |   (FN)          |
              |               |  (Sensitivity) |  (Type-2 Error) |
              +---------------+---------------+------------------+
              |  Negative     |  (FP)          |      (TN)       |
              |               |(Type-1 Error)  |  (Specificity)  |
              +---------------+---------------+------------------+

Accuracy: (TP+TN) / (TP+FP+FN+TN)
Precision: TP / (TP+FP)
Recall: TP / (TP+FN)
f1-score: 2 * Precision * Recall / (Precision+Recall)
Sensitivity=Recall
Specificity: True negatives among all negatives = TN/(TN+FP)
ROC: Graph at various thresholds between True Positive Rate (y-axis) vs False Positive Rate (x-axis):
- TPR = Sensitivity = Recall
- FPR: FP + TN = 1 =>
  - FPR = 1- TN or 1 - Specificity
  - FPR= 1 - (TN/TN+FP) = FP / (TN+FP)
ROC Curves are threshold-invariant.
AUC: Area under curve from ROC
Log Loss: Cross entropy loss $$ L(y, \hat{y}) = -\sum_{i=1}^n y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) $$

import numpy as np

def confusion_matrix(y_true, y_pred, threshold=0.5):
    y_pred = np.where(y_pred >= 0.5, 1, 0)
    mat = np.zeros((2,2))
    tp = mat[0,0] = np.sum((y_true == 1) & (y_pred == 1))
    tn = mat[1,1] = np.sum((y_true == 0) & (y_pred == 0))
    fp = mat[1,0] = np.sum((y_true == 0) & (y_pred == 1))
    fn = mat[0,1] = np.sum((y_true == 1) & (y_pred == 0))
    return mat

confusion_matrix(np.array([0,1,0,1,1]),np.array([0.3,0.9,0.9,0.6,0.4]))

array([[2., 1.],
       [1., 1.]])

Imbalanced datasets#

Regression#

Mean Squared Error
R*2 score
Mean Absolute Error
Mean Absolute Percentage Error
Explained Variance Score

def mse(y_true, y_pred):
    return np.mean((y_true-y_pred)**2)

def mae(y_true, y_pred):
    return np.mean(np.abs(y_true-y_pred))

def mape(y_true, y_pred):
    return np.mean(np.abs(y_true-y_pred) / y_true) * 100

def r2_score(y_true, y_pred):
    sse = np.sum((y_true-y_pred) ** 2)
    sst = np.sum((y_true-np.mean(y_true)) ** 2)
    return 1 - sse/ sst

Retrieval / Ranking#

Precision@ k: Number of Relevant Docs / Number of Retrieved docs @k
Recall @ k: Number of Relevant Docs @ k/ Total number of Relevant docs

def precision_at_k(relevant_docs, retrieved_docs,k):
    relevant_in_topk = len(set(relevant_docs).intersection(retrieved_docs[:k]))
    return relevant_in_topk / k

def recall_at_k(relevant_docs, retrieved_docs,k):
    relevant_in_topk = len(set(relevant_docs).intersection(retrieved_docs[:k]))
    return relevant_in_topk / len(relevant_docs)

Mean average Precision: Averages the precision at every position in the ranking where a relevant document appears $$ MAP = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{1}{|R_i|} \sum_{k=1}^{|R_i|} P(k) \right) $$
- Predicted ranking: [D1, D2, D3, D4, D5], documents are assumed to be ranked according to scores
- Relevance labels: [1,0,1,1,1]
- At Position 1: Precision = 1/1 = 1
- Position 2: Precision = 0 (document 2 not relevant)
- Position 3: Precision = 2/3
- Position 4: 3/4
- Position 5: 4/5
- Mean average precision = 1/4 x (1+2/3+3/4+4/5)

def average_precision(y_true, y_pred):
    relevant = 0
    precisions = []
    for i in range(len(y_pred)):
        if y_true[i] == 1:
            # relevant, compute
            relevant += 1
            precision_at_i = relevant / (i+1)
            precisions.append(precision_at_i)
    return np.sum(precisions) / len(precisions)

average_precision([1,0,1,1,1],[1,2,3,4,5])

np.float64(0.8041666666666667)

NDCG (Normalized Discounted Cumulative Gain)
- Considers the position of the document in the retrieved list $$ DCG_k = \sum_{i=1}^{k} \frac{\text{rel}_i}{\log_2(i + 1)} $$
- At position 1 => 1/ log2 = 1
- Position 2 = 1/ log(3)
- Position 3 = 1/log(4) = .5
- As the position increases, the weightage on the metric decreases
- NDCG normalizes DCG by considering ideal rank. $$ nDCG_k = \frac{DCG_k}{IDCG_k} $$
- Ground Truth = [3, 2, 3, 0, 1]
- Predicted Scores = [2, 3, 3, 1, 0]
- DCG = 2 / log(1+1) + 3 / log(2+1) + 3/ log(3+1) + 1 / log(4+1) + 0/log(5+1)
- Ideal DCG: Ideal Rank = [3,3,2,1,0]
- IDCG = 3 / log(1+1) + 3 / log(2+1) + 2/ log(3+1) + 1 / log(4+1) + 0/log(5+1)

def dcg_at_k(relevance_scores , k):
    top_k_scores = relevance_scores[:k]
    denominator = np.log(np.arange(2,2+k))
    return np.sum(top_k_scores/ denominator)

def ndcg_at_k(y_true, y_pred, k):
    y_true_sorted = np.sort(y_true)
    idcg = dcg_at_k(y_true_sorted,k)
    order = np.argsort(y_pred)[::-1]
    y_true = np.take(y_true, order)
    dcg = dcg_at_k(y_true,k)
    return dcg / idcg

Mean Resiprocal Rank (MRR) $$ MRR = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{\text{rank}_i} $$

LLM Metrics#

Evaluating quality of generative text models
- Perplexity: How well model predicts a sample of the test dataset. Lower perplexity indicates better performance. $$ PP = \exp \left( -\frac{1}{n} \sum_{i=1}^{n} \log p(x_i) \right) $$ L = -\frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik} \log p(x_i = k) $$
- BLEU (Bilingual Evaluation Understudy Score): Measures overlap between generated text and reference text. $$ BS = \exp \left( \sum_{n=1}^{4} w_n \log p_n \right) $$
  - Calculate precision for each n-gram
  - Original: “Quick brown fox”.
  - Generated: “Quick browns fox”.
  - P@1: 2/3, P@2: 1/3, P@3: 0/3
  - BLEU = exp(w1 * log(p@1) + w2 * log(p@2) + w3 * log(p@3))
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation) $$ RS = \frac{1}{n} \sum_{i=1}^{n} \frac{\sum_{k=1}^{K} \min (count_{ik}, count_{ik}')}{\sum_{k=1}^{K} count_{ik}'} $$

Metrics and Biases

Contents

Metrics and Biases#

Biases in ML systems#

Data Biases#

Model Biases#

Metrics#

Classification#

Imbalanced datasets#

Regression#

Retrieval / Ranking#

LLM Metrics#