July 5, 2026 · AI Security

OWASP PwnzzAI Lab 2: Training Data Poisoning — a technical write-up

This is the second lab in my OWASP PwnzzAI series. Lab 1 covered model extraction; Lab 2 is Training Data Poisoning — mapped to LLM04: Data and Model Poisoning in the OWASP Top 10 for LLM Applications 2025.

The attack is simpler in concept than model theft, but the real-world impact is severe: if an application retrains or fine-tunes on user-submitted content without validation, an attacker can corrupt model behavior with nothing more than fake reviews.

1. Threat model: what poisoning means here

Data poisoning at training time means injecting malicious samples into the dataset so the learned model behaves incorrectly at inference. In this lab the vector is label flipping:

Submit text that reads positive, but label it negative (or vice versa).
Those samples are merged with legitimate pizza reviews and used to retrain.
Logistic regression adjusts token weights to fit the poisoned labels.
At inference, formerly positive phrases score negative — and negative phrases can flip positive.

This is not prompt injection. The attacker never touches the inference API directly. They corrupt the training pipeline — the same risk class as poisoned fine-tuning data, corrupted feedback loops, or unvalidated user content in retraining jobs.

PwnzzAI data poisoning lab overview with original model word weights — Figure 1 — Lab overview: original sentiment model trained on pizza reviews, with top positive and negative token weights before any attack

2. Target architecture

The lab reuses the same bag-of-words logistic regression stack as Lab 1:

CountVectorizer(max_features=100) — 100-token vocabulary cap.
LogisticRegression(C=10.0) — linear classifier on token counts.
Labels binarized from star ratings: 3+ stars → positive, below → negative.
~25 seeded DB comments form the clean baseline; user comments are appended at retrain time.

The vulnerability is intentional: the UI accepts arbitrary user comments with attacker-chosen sentiment labels and feeds them straight into retraining with no validation, anomaly detection, or provenance checks.

Original model weights and existing training comments — Figure 2 — Baseline model internals: seeded review corpus and original coefficient distribution (`excellent` +0.640 positive, `disgusting` −0.915 negative)

3. Attack design: targeted label flipping

The lab hint points at the core technique: mark clearly positive text as negative, and clearly negative text as positive. I built two poison buckets:

Positive words, negative label — e.g. excellent amazing delicious perfect fantastic
Negative words, positive label — e.g. awful disgusting terrible dry bland

Repeating variations increases pressure on the exact tokens I planned to test later. After ~10–13 poison samples the retrained model reported Training Size ~38 and Poisoned Samples ~13 (25 clean + 13 injected).

List of mislabeled poison comments added for retraining — Figure 3 — Injected training samples: positive-sounding text labeled negative (red) and negative-sounding text labeled positive (green)

4. Observing weight inversion

After clicking Train Poisoned Model, the UI compares new coefficients against the original weights loaded at page open. Significant shifts and sign flips appear in the Weight Changes panel.

Notable inversions from my run:

excellent: +0.64 → −1.38 (positive predictor became negative)
awful: 0.00 → +1.59 (entered top positive predictors)
boring: −0.93 → +0.98 (sign flip)
bad / horrible: 0.00 → −2.45 (amplified negative — collateral shift)

The poisoned model’s top-negative list now includes words like excellent, delicious, and amazing — tokens that belonged in the positive column on the clean model.

Poisoned model weight analysis showing flipped word coefficients — Figure 4 — Post-poison weight analysis: `excellent` and `awful` inverted; weight-change cards show before → after coefficient shifts

5. Proof of impact: bidirectional misclassification

A successful poison demo needs more than shifted weights — predictions on normal language must flip. I tested two canonical phrases against the poisoned model:

Test A — positive phrase

excellent delicious amazing

Clean model (expected): positive, ~90% confidence
Poisoned model (observed): negative, 63.1% confidence, score −0.54

Poisoned model classifies excellent delicious amazing as negative — Figure 5 — Inference after poisoning: clearly positive phrase classified as negative (63.1% confidence)

Test B — negative phrase

awful disgusting terrible

Clean model (expected): negative, ~98% confidence
Poisoned model (observed): positive, 61.4% confidence, score +0.46

Poisoned model classifies awful disgusting terrible as positive — Figure 6 — Bidirectional flip: clearly negative phrase classified as positive (61.4% confidence) after label-flip poisoning

Both directions flipped. Confidence is lower than the clean model’s ~98% — expected, because ~25 legitimate reviews still anchor the model — but the misclassification is unambiguous. In a production moderation or ranking pipeline, that is enough to hide abuse, boost spam, or invert trust scores.

6. What did not work on the first attempt

With only a handful of poison samples, excellent delicious amazing still returned positive at ~92% confidence. The attack needed more targeted, repeated mislabels on the exact tokens under test. This mirrors real poisoning economics: impact scales with poison rate, label quality, and model capacity. Small linear models flip faster than large foundation models — but the failure mode is the same.

7. Why this matters beyond the pizza shop

Any system that learns from user content inherits this attack surface:

Review / feedback fine-tuning — fake 5-star or 1-star text with inverted intent.
RLHF / preference data — poisoned human labels skew alignment.
Continuous learning — unvalidated new samples drift the model over time.
Third-party datasets — supply-chain poisoning without ever touching production APIs.

PwnzzAI also ships a separate RAG poisoning lab under the same category — retrieval-time corruption is the inference-side cousin of what this lab demonstrates at training time.

8. Defenses I would implement

Do not train on raw user input without review, rate limits, and reputation scoring.
Anomaly detection on label/text mismatches (positive vocabulary + negative label).
Data provenance — track source, timestamp, and trust tier for every training sample.
Hold-out evaluation on a frozen, trusted test set; alert on metric drift after retrain.
Robust training — trimmed loss, differential privacy, or poison-aware aggregation where feasible.
Separate trust boundaries — user-facing submission APIs should not feed training jobs directly.

9. What I demonstrated

Threat modeling training-time poisoning (LLM04) vs inference-time attacks.
Label-flip attack design against a logistic regression sentiment model.
Iterative poisoning — increasing sample count until predictions flipped.
Coefficient inversion analysis via the lab’s weight-change UI.
Bidirectional misclassification on canonical positive and negative phrases.
Mapping lab results to real fine-tuning, feedback-loop, and dataset-supply risks.