Not theoretical edge cases. These are patterns that affect most blood test uploads.
Failure mode 1 of 5
It doesn't know your sex unless you tell it
Several of the most clinically important markers have different reference ranges for men and women. ChatGPT, Claude, Gemini — any general-purpose AI — starts from population-average defaults. Unless you explicitly say "I am a 32-year-old woman," it may apply the wrong threshold and either miss a finding or flag something unnecessarily.
Concrete example: Female hemoglobin of 13.3 g/dL. The male lower limit is 13.5 g/dL — so a general AI may flag this as low. The female lower clinical limit is 12.0 g/dL — so it's actually within range for a woman. Getting this wrong changes the entire interpretation.
Failure mode 2 of 5
It gives you a list. Not a ranking.
Upload a blood test with 15–20 values outside the reference range and ask ChatGPT to interpret it. You'll get 15–20 explanations, roughly equal in length and emphasis. There's no mechanism to determine which finding is most clinically urgent for your specific profile — family history of diabetes, current symptoms, age, sex.
The problem: A borderline HbA1c in someone with a family history of diabetes deserves more attention than a slightly elevated total cholesterol in someone young and otherwise healthy. General AI doesn't make this call. FixFirst's priority algorithm does.
Failure mode 3 of 5
It doesn't know about borderline zones
Lab reference ranges are designed to catch disease — they flag values outside the 95th percentile of a reference population. There is a well-documented gap between "technically normal" and "functionally optimal." General AI applies the lab's printed reference range. It has no database of clinical research on borderline zones.
Example: Ferritin of 16 ng/mL prints as normal on most lab reports. Research (Verdon et al., BMJ 2003) found fatigue improvement in women with ferritin below 50 ng/mL, even without anaemia. ChatGPT won't flag this — FixFirst will.
Failure mode 4 of 5
Recommendations are generated, not curated
When ChatGPT tells you to take 2,000 IU of vitamin D or eat more leafy greens for folate, it synthesises that recommendation at inference time from its training data. The dose, the dietary sources, the timeline for improvement — these are generated fresh each session. They may be accurate; they may not be. They're not anchored to a specific guideline with a version history.
By contrast: FixFirst's diet protocols, supplement doses, and response windows come from a curated database built from ADA, ATA, NICE, NIH, and ACC/AHA guidelines. Same upload, same answer, every time — with traceable sources.
Failure mode 5 of 5
The same input can produce different outputs across sessions
General language models are probabilistic — the same prompt doesn't produce the same output every time. For most tasks this doesn't matter. For something health-adjacent, where you might share results with a family member or revisit them months later, consistency matters more than it seems. If you upload the same report twice and get different priority rankings, which one do you trust?
Why this matters: FixFirst is deterministic. The same values produce the same flags, the same priority ranking, and the same action plan. Not because we use simpler logic — because clinical guidelines don't change between sessions, and our rules engine doesn't have a temperature setting.