Metrics and Biases#

Biases in ML systems#

Data Biases#

  • Sampling bias: Data mismatch with the real distribution

    • Training a model on data that is convenient for you or ready for you potentially excluding other cases.

  • Labelling bias: Inaccurate or inconsistent labels

  • Historical bias: Data collected in the past may not be representative

  • Recency bias: The recent data might not be representative.

Model Biases#

  • Confirmation Bias: Model could be biased towards existing beliefs

  • Overfitting: Model too faithful to the data it was trained on

  • Feature Bias: Gender, Race, Zipcodes could be correlated and lead to biased models

  • Algorithmic Bias: Algorithms could inherently be biased towards a class or view

    • Decision Trees priortize higher information gain. It could lead to unfair predictions that target specific groups.

    • Adverserial Attacks: Neural Networks can be prone to attacks where the model could be mislead by carefully modifying the input.

    • Recommender Systems: Can cause filter bubbles where all users will be recommended similar items because of the underlying data already has biases.

    • Word Embeddings can be misleading as the text could be having a lot of biases. Certain professions may be correlated to certain genders.

Metrics#

Classification#

              Predicted Class
              +---------------+---------------+--------------+
              |               |  Positive     |  Negative    |
              +---------------+---------------+--------------+
Actual Class  |  Positive     |  TP (True     |   FN (False  |
              |               |  Positive)    |  Negative)   |
              +---------------+---------------+--------------+
              |  Negative     |  FP (False    |  TN (True    |
              |               |  Positive)    |  Negative)   |
              +---------------+---------------+--------------+


              Predicted Class
              +---------------+---------------+------------------+
              |               |  Positive      |  Negative       |
              +---------------+---------------+------------------+
Actual Class  |  Positive     |  (TP)          |   (FN)          |
              |               |  (Sensitivity) |  (Type-2 Error) |
              +---------------+---------------+------------------+
              |  Negative     |  (FP)          |      (TN)       |
              |               |(Type-1 Error)  |  (Specificity)  |
              +---------------+---------------+------------------+
  • Accuracy: (TP+TN) / (TP+FP+FN+TN)

  • Precision: TP / (TP+FP)

  • Recall: TP / (TP+FN)

  • f1-score: 2 * Precision * Recall / (Precision+Recall)

  • Sensitivity=Recall

  • Specificity: True negatives among all negatives = TN/(TN+FP)

  • ROC: Graph at various thresholds between True Positive Rate (y-axis) vs False Positive Rate (x-axis):

    • TPR = Sensitivity = Recall

    • FPR: FP + TN = 1 =>

      • FPR = 1- TN or 1 - Specificity

      • FPR= 1 - (TN/TN+FP) = FP / (TN+FP)

  • ROC Curves are threshold-invariant.

  • AUC: Area under curve from ROC

  • Log Loss: Cross entropy loss $\( L(y, \hat{y}) = -\sum_{i=1}^n y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \)$

import numpy as np
def confusion_matrix(y_true, y_pred, threshold=0.5):
    y_pred = np.where(y_pred >= 0.5, 1, 0)
    mat = np.zeros((2,2))
    tp = mat[0,0] = np.sum((y_true == 1) & (y_pred == 1))
    tn = mat[1,1] = np.sum((y_true == 0) & (y_pred == 0))
    fp = mat[1,0] = np.sum((y_true == 0) & (y_pred == 1))
    fn = mat[0,1] = np.sum((y_true == 1) & (y_pred == 0))
    return mat
confusion_matrix(np.array([0,1,0,1,1]),np.array([0.3,0.9,0.9,0.6,0.4]))
array([[2., 1.],
       [1., 1.]])

Imbalanced datasets#

Regression#

  • Mean Squared Error

  • R*2 score

  • Mean Absolute Error

  • Mean Absolute Percentage Error

  • Explained Variance Score

def mse(y_true, y_pred):
    return np.mean((y_true-y_pred)**2)
def mae(y_true, y_pred):
    return np.mean(np.abs(y_true-y_pred))
def mape(y_true, y_pred):
    return np.mean(np.abs(y_true-y_pred) / y_true) * 100
def r2_score(y_true, y_pred):
    sse = np.sum((y_true-y_pred) ** 2)
    sst = np.sum((y_true-np.mean(y_true)) ** 2)
    return 1 - sse/ sst

Retrieval / Ranking#

  • Precision@ k: Number of Relevant Docs / Number of Retrieved docs @k

  • Recall @ k: Number of Relevant Docs @ k/ Total number of Relevant docs

def precision_at_k(relevant_docs, retrieved_docs,k):
    relevant_in_topk = len(set(relevant_docs).intersection(retrieved_docs[:k]))
    return relevant_in_topk / k
def recall_at_k(relevant_docs, retrieved_docs,k):
    relevant_in_topk = len(set(relevant_docs).intersection(retrieved_docs[:k]))
    return relevant_in_topk / len(relevant_docs)
  • Mean average Precision: Averages the precision at every position in the ranking where a relevant document appears $\( MAP = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{1}{|R_i|} \sum_{k=1}^{|R_i|} P(k) \right) \)$

    • Predicted ranking: [D1, D2, D3, D4, D5], documents are assumed to be ranked according to scores

    • Relevance labels: [1,0,1,1,1]

    • At Position 1: Precision = 1/1 = 1

    • Position 2: Precision = 0 (document 2 not relevant)

    • Position 3: Precision = 2/3

    • Position 4: 3/4

    • Position 5: 4/5

    • Mean average precision = 1/4 x (1+2/3+3/4+4/5)

def average_precision(y_true, y_pred):
    relevant = 0
    precisions = []
    for i in range(len(y_pred)):
        if y_true[i] == 1:
            # relevant, compute
            relevant += 1
            precision_at_i = relevant / (i+1)
            precisions.append(precision_at_i)
    return np.sum(precisions) / len(precisions)
average_precision([1,0,1,1,1],[1,2,3,4,5])
np.float64(0.8041666666666667)
  • NDCG (Normalized Discounted Cumulative Gain)

    • Considers the position of the document in the retrieved list $\( DCG_k = \sum_{i=1}^{k} \frac{\text{rel}_i}{\log_2(i + 1)} \)$

    • At position 1 => 1/ log2 = 1

    • Position 2 = 1/ log(3)

    • Position 3 = 1/log(4) = .5

    • As the position increases, the weightage on the metric decreases

    • NDCG normalizes DCG by considering ideal rank. $\( nDCG_k = \frac{DCG_k}{IDCG_k} \)$

    • Ground Truth = [3, 2, 3, 0, 1]

    • Predicted Scores = [2, 3, 3, 1, 0]

    • DCG = 2 / log(1+1) + 3 / log(2+1) + 3/ log(3+1) + 1 / log(4+1) + 0/log(5+1)

    • Ideal DCG: Ideal Rank = [3,3,2,1,0]

    • IDCG = 3 / log(1+1) + 3 / log(2+1) + 2/ log(3+1) + 1 / log(4+1) + 0/log(5+1)

def dcg_at_k(relevance_scores , k):
    top_k_scores = relevance_scores[:k]
    denominator = np.log(np.arange(2,2+k))
    return np.sum(top_k_scores/ denominator)
def ndcg_at_k(y_true, y_pred, k):
    y_true_sorted = np.sort(y_true)
    idcg = dcg_at_k(y_true_sorted,k)
    order = np.argsort(y_pred)[::-1]
    y_true = np.take(y_true, order)
    dcg = dcg_at_k(y_true,k)
    return dcg / idcg
  • Mean Resiprocal Rank (MRR) $\( MRR = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{\text{rank}_i} \)$

LLM Metrics#

  • Evaluating quality of generative text models

    • Perplexity: How well model predicts a sample of the test dataset. Lower perplexity indicates better performance. $\( PP = \exp \left( -\frac{1}{n} \sum_{i=1}^{n} \log p(x_i) \right) \)\( \)\( L = -\frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik} \log p(x_i = k) \)$

    • BLEU (Bilingual Evaluation Understudy Score): Measures overlap between generated text and reference text. $\( BS = \exp \left( \sum_{n=1}^{4} w_n \log p_n \right) \)$

      • Calculate precision for each n-gram

      • Original: “Quick brown fox”.

      • Generated: “Quick browns fox”.

      • P@1: 2/3, P@2: 1/3, P@3: 0/3

      • BLEU = exp(w1 * log(p@1) + w2 * log(p@2) + w3 * log(p@3))

    • ROUGE (Recall-Oriented Understudy for Gisting Evaluation) $\( RS = \frac{1}{n} \sum_{i=1}^{n} \frac{\sum_{k=1}^{K} \min (count_{ik}, count_{ik}')}{\sum_{k=1}^{K} count_{ik}'} \)$