Skip to main content

📊 ML Evaluation Metrics

ChildSafeNet evaluates ML models with a safety-first philosophy: prioritize precision for blocking categories while maintaining sufficient recall to prevent dangerous sites from slipping through.

Precision FirstSafety CriticalAuditable Metrics

1. Classification Metrics

ChildSafeNet uses multi-class classification to detect:

BenignPhishingMalwareAdultGambling

Because this is a safety-critical system, evaluation is not only about accuracy, but also about minimizing harmful mistakes.


1
Metric

Accuracy

Accuracy measures overall correctness:

Accuracy = (TP + TN) / Total

Limitations:

  • Can be misleading when dataset is imbalanced
  • High accuracy does NOT guarantee good performance for rare harmful classes

2
Metric

Precision

Precision measures how many predicted positives are actually correct:

Precision = TP / (TP + FP)

In child safety systems, precision for blocking categories is critical.

High precision means:

  • fewer safe websites are blocked incorrectly
  • higher trust for parents

3
Metric

Recall

Recall measures how many real harmful cases were detected:

Recall = TP / (TP + FN)

Higher recall means:

  • more harmful sites detected
  • fewer dangerous sites slip through

Trade-off:

  • increasing recall can reduce precision

4
Metric

F1 Score

F1 = 2 * (Precision * Recall) / (Precision + Recall)

Used when:

  • balancing precision and recall
  • dealing with imbalanced datasets


2. Multi-Class Evaluation

For multi-class detection:

  • calculate metrics per class
  • use Macro F1 (average of class F1)
  • use Weighted F1 (weighted by support)

Example focus classes:

PhishingAdultGamblingMalware

The Benign class is important, but safety categories must be prioritized.


3. Confusion Matrix

Confusion matrix shows:

  • True Positive
  • False Positive
  • True Negative
  • False Negative

In multi-class setup:

Rows = Actual class
Columns = Predicted class

Used to:

  • identify class confusion patterns
  • detect over-blocking behavior
  • detect under-detection of harmful categories

Example issue:

  • Benign → Adult = False Positive (bad UX)
  • Phishing → Benign = False Negative (security risk)

4. Safety Priority Strategy

In child safety systems:

  1. Prefer HIGH precision for block categories
  • avoid blocking educational/safe websites
  1. Maintain acceptable recall for harmful classes
  • do not allow too many dangerous sites through
  1. Monitor false-positive rate on whitelist domains

  2. Track category-level metrics independently


5. Thresholding Strategy

For probabilistic models:

Recommended thresholds:

  • BLOCK if score ≥ 0.85
  • WARN if 0.60 ≤ score < 0.85
  • ALLOW if score < 0.60

Rules:

  • whitelist always overrides to ALLOW
  • blacklist always overrides to BLOCK
  • thresholds configurable in system settings

6. Model Comparison (Before Release)

Before activating a new model version, compare with the previous one:

  • Accuracy
  • Precision (block classes)
  • Recall (block classes)
  • F1 Score
  • False Positive Rate
  • Inference Latency

Reject deployment if:

  • precision drops significantly
  • false positives increase sharply
  • recall for harmful categories decreases too much

7. Suggested Reporting Template

When logging a new model version:

  • Model Version ID
  • Training Date
  • Dataset Size
  • Accuracy
  • Macro F1
  • Precision (Adult / Gambling / Phishing)
  • Recall (Adult / Gambling / Phishing)
  • False Positive Rate
  • Notes

Stored in: ModelRegistry table


8. Long-Term Improvements

Future evaluation improvements:

  • ROC curve analysis
  • Precision-Recall curves
  • Drift detection monitoring
  • Per-category threshold tuning
  • Online evaluation dashboard

Summary

ChildSafeNet evaluation focuses on:

  • precision-first safety
  • controlled recall
  • threshold-based decision control
  • safe model versioning
  • auditable metrics tracking

Goal: maximum safe and trustworthy behavior, not maximum accuracy.