July 5, 2026 · AI Security

OWASP PwnzzAI Lab 2: Training Data Poisoning — a technical write-up

This is the second lab in my OWASP PwnzzAI series. Lab 1 covered model extraction; Lab 2 is Training Data Poisoning — mapped to LLM04: Data and Model Poisoning in the OWASP Top 10 for LLM Applications 2025.

The attack is simpler in concept than model theft, but the real-world impact is severe: if an application retrains or fine-tunes on user-submitted content without validation, an attacker can corrupt model behavior with nothing more than fake reviews.

1. Threat model: what poisoning means here

Data poisoning at training time means injecting malicious samples into the dataset so the learned model behaves incorrectly at inference. In this lab the vector is label flipping:

  1. Submit text that reads positive, but label it negative (or vice versa).
  2. Those samples are merged with legitimate pizza reviews and used to retrain.
  3. Logistic regression adjusts token weights to fit the poisoned labels.
  4. At inference, formerly positive phrases score negative — and negative phrases can flip positive.

This is not prompt injection. The attacker never touches the inference API directly. They corrupt the training pipeline — the same risk class as poisoned fine-tuning data, corrupted feedback loops, or unvalidated user content in retraining jobs.

PwnzzAI data poisoning lab overview with original model word weights
Figure 1 — Lab overview: original sentiment model trained on pizza reviews, with top positive and negative token weights before any attack

2. Target architecture

The lab reuses the same bag-of-words logistic regression stack as Lab 1:

The vulnerability is intentional: the UI accepts arbitrary user comments with attacker-chosen sentiment labels and feeds them straight into retraining with no validation, anomaly detection, or provenance checks.

Original model weights and existing training comments
Figure 2 — Baseline model internals: seeded review corpus and original coefficient distribution (excellent +0.640 positive, disgusting −0.915 negative)

3. Attack design: targeted label flipping

The lab hint points at the core technique: mark clearly positive text as negative, and clearly negative text as positive. I built two poison buckets:

Repeating variations increases pressure on the exact tokens I planned to test later. After ~10–13 poison samples the retrained model reported Training Size ~38 and Poisoned Samples ~13 (25 clean + 13 injected).

List of mislabeled poison comments added for retraining
Figure 3 — Injected training samples: positive-sounding text labeled negative (red) and negative-sounding text labeled positive (green)

4. Observing weight inversion

After clicking Train Poisoned Model, the UI compares new coefficients against the original weights loaded at page open. Significant shifts and sign flips appear in the Weight Changes panel.

Notable inversions from my run:

The poisoned model’s top-negative list now includes words like excellent, delicious, and amazing — tokens that belonged in the positive column on the clean model.

Poisoned model weight analysis showing flipped word coefficients
Figure 4 — Post-poison weight analysis: excellent and awful inverted; weight-change cards show before → after coefficient shifts

5. Proof of impact: bidirectional misclassification

A successful poison demo needs more than shifted weights — predictions on normal language must flip. I tested two canonical phrases against the poisoned model:

Test A — positive phrase

excellent delicious amazing
Poisoned model classifies excellent delicious amazing as negative
Figure 5 — Inference after poisoning: clearly positive phrase classified as negative (63.1% confidence)

Test B — negative phrase

awful disgusting terrible
Poisoned model classifies awful disgusting terrible as positive
Figure 6 — Bidirectional flip: clearly negative phrase classified as positive (61.4% confidence) after label-flip poisoning

Both directions flipped. Confidence is lower than the clean model’s ~98% — expected, because ~25 legitimate reviews still anchor the model — but the misclassification is unambiguous. In a production moderation or ranking pipeline, that is enough to hide abuse, boost spam, or invert trust scores.

6. What did not work on the first attempt

With only a handful of poison samples, excellent delicious amazing still returned positive at ~92% confidence. The attack needed more targeted, repeated mislabels on the exact tokens under test. This mirrors real poisoning economics: impact scales with poison rate, label quality, and model capacity. Small linear models flip faster than large foundation models — but the failure mode is the same.

7. Why this matters beyond the pizza shop

Any system that learns from user content inherits this attack surface:

PwnzzAI also ships a separate RAG poisoning lab under the same category — retrieval-time corruption is the inference-side cousin of what this lab demonstrates at training time.

8. Defenses I would implement

9. What I demonstrated

References