Skip to main content

🧠 AI Training Option Periodic

A stable and auditable retraining strategy: collect continuously, review labels, then train on a fixed schedule with quality gates and rollback.

StableReviewed LabelsRollback-ready

1. Overview

Periodic Training (Option Periodic) is designed to keep the model stable and the system fast while still learning from new data.
Instead of retraining immediately after each new URL, the system:

  • collects URL samples continuously,
  • requires admin review before samples affect training,
  • retrains on a fixed schedule (daily/weekly),
  • releases new models only if they pass quality gates,
  • keeps a fallback model to roll back safely.

2. Why Periodic?

Training immediately after every new URL can:

  • make models unstable (drift from noisy updates),
  • introduce noisy labels (wrong/uncertain ground truth),
  • slow down the system (training is expensive and can block operations).

Periodic training solves this by separating data collection from model updates, with a controlled release process.


3. Data States

During collection and review, each sample belongs to a simple state machine:

PendingApprovedRejected
  • Pending: collected but not reviewed yet
  • Approved: trusted sample that can be used for training
  • Rejected: not used for training (invalid / duplicate / low confidence)
Recommended rule

Only Approved samples are eligible for training.


4. Pipeline

1
Collect

Collect URL dataset (Pending)

  • Extension/API records normalized URL
  • predicted label (Adult/Gambling/Phishing/Benign)
  • score/confidence
  • timestamp + source metadata

Saved as: Pending

2
Review

Admin Review

Admin reviews samples and decides:

  • Approve if label is correct
  • Reject if sample is noise, duplicate, or unclear

3
Training

Periodic Training Job

Training pipeline:

  • merge baseline dataset + approved dataset
  • run feature extraction
  • train Random Forest + NLP pipeline
  • evaluate metrics
  • register model version

4
Deploy

Safe Deployment

  • keep previous model as fallback
  • deploy only if quality gates pass
  • rollback if drift detected


5. Suggested Schedule

Pick schedule depending on environment:

  • Daily: best for demo / rapid iteration
  • Weekly: best for stable production
Practical suggestion

During competitions/demos: train daily, but only when you have enough Approved samples (avoid training on tiny batches).


6. Quality Gates (Release Checklist)

A new model version is released only if it satisfies:

  1. High precision for block labels
  • Adult / Gambling / Phishing must have high precision
  • goal: avoid blocking safe websites (trust is critical)
  1. Low false-positive on allowlisted domains
  • allowlisted/trusted domains should rarely be blocked
  • monitor allowlist incidents as a priority
  1. No major regression vs previous model
  • compare key metrics to last released version
  • if regression exceeds threshold → do not deploy
  1. Operational sanity
  • inference latency stays acceptable
  • model size and loading time stable

7. Minimal Admin Workflow (Fast Demo)

  1. Open Admin → Review Pending
  2. Approve a small set of correct samples
  3. Run training job (Periodic)
  4. Check metrics summary
  5. Publish model version
  6. Test: scan a few URLs again and verify decisions/logs