Grouping Problem Codes

Batch Evaluation (2026-02-09)

prod GET /api/problem/codes affliction_limit=5 code_limit=12
ps.finalbuildgames.com/grouping-problem-codes-2026-02-09/

What We Tested

  • 100 natural-language user problem queries (93 expected symptom-ish, 7 expected admin/vague).
  • Production endpoint: https://grouping.finalbuildgames.com/api/problem/codes
  • Judgment: "Does top 5 contain plausible first-line direction and avoid being dominated by unrelated procedures?"
Automated report and JSONL payloads are included in /reports/.

Headline Metrics

1%
Automated batch eval PASS rate (1 / 100)
PASS: "I twisted my ankle playing soccer."
8%
Manual re-judgment PASS rate (8 / 100)
Most "passes" are injury / localized exam cases.
Queries expected to be symptom-ish
93
expected=true
Queries expected to be admin/vague
7
expected=false

What Dominates the Top 5

Baseline E/M in Top 5

In the automated run, 99203/99204/99213/99214 make up 246 / 500 top-5 slots.
49.2% of top-5 items are baseline E/M.
57/100 queries have 4 E/M codes in the top 5.

Where Top 5 Comes From

Across top-5 items (automated run):
principal_semantic 254 / 500
baseline_eval 246 / 500
Net: top results are often "E/M + one unrelated semantic procedure".

Manual PASS Cases (8)

  • I twisted my ankle playing soccer.
  • I think I broke my arm.
  • My shoulder is dislocated.
  • My ear is ringing.
  • My throat feels swollen.
  • I feel like something is stuck in my throat.
  • My toe fell off.
  • My ass hurts.
These tend to have "anatomy-specific / localized" procedure families available.

Representative FAILs (Automated)

Persistent headache

Top 5:
99203, 99204, 99213, 99214,
62000 Treat skull fracture
Problem: no headache-first-line diagnostics; semantic drifts to trauma.

Pregnancy suspicion

Top 5:
99203, 99204, 99213, 99214,
58555 Hysteroscopy diagnostic
Problem: invasive gyn procedures show up early; no "preg test / labs" direction.

Stuffy nose

Top 5:
99203, 99204, 99213, 99214,
21087 Nasal prosthesis prep
Problem: semantic picks prosthesis; missing primary care "URI / rhinitis" direction.

Administrative/Vague Queries Mislead

Refill BP meds (expected=false)

62225 Replace/irrigate catheter
62194 Replace/irrigate catheter
67025 Replace eye fluid
42972 Control nose/throat bleeding
43255 EGD control bleeding

Need a flu shot (expected=false)

27397 Transplants of thigh tendons
64911 Neurorraphy w/ vein autograft
75870 Vein x-ray skull
27517 Treat thigh fx growth plate
43201 Esoph scope submucous injection
Intent gating needed: these should not surface invasive/specific procedures.

Failure Patterns (Observed)

  • Symptom mode often returns baseline E/M plus an unrelated procedure.
  • Principal semantic overpowers "first-line" evaluation when mapping evidence is sparse.
  • Affliction matching sometimes chooses unrelated afflictions (low specificity and synonym gaps).
  • Admin intent (refill, note, physical, vaccine) not handled; output looks dangerously specific.
  • No family guardrails: prosthesis, invasive surgery, obscure imaging appear for casual complaints.

Current Shape (Inferred From Payloads)

Query Affliction match hybrid baseline_eval 9920x principal_semantic many candidates Rank + Top N
Missing: explicit intent classifier + family guardrails + "first-line diagnostics" boosting when semantic evidence is weak.

Recommended Fixes (Order Matters)

  1. Intent gate: admin vs symptom vs injury vs preventive; suppress semantically invasive families for symptom/admin.
  2. Guardrails: block/penalize prosthesis, major surgery, unrelated imaging in symptom mode unless supported by mapping evidence.
  3. Affliction matching: expand synonyms and require stronger evidence before emitting narrow afflictions.
  4. Ranking: require at least 1-2 "first-line" diagnostics/minor options in top 5 when expected=true.
  5. Reranker (optional): lightweight cross-encoder on a small candidate set for stability.

Embedding/IR Direction (From Research Report)

Dense encoder recommendation

BioLORD-2023 as the primary dense model for biomedical semantic retrieval.

Keep UMLS out of the hot path. Seed/normalize offline; use dense+lexical hybrid at query time.

Hybrid retrieval shape

  • Dense retrieval: symptom phrasing robustness.
  • Lexical guardrail: keep anatomy / critical terms anchored.
  • Optional rerank: stabilize top 5 for user trust.

Next Steps

  • Today: implement admin intent suppression + procedure-family penalties in symptom mode; add eval harness to CI.
  • This week: improve affliction synonyms; add a "first-line diagnostics" boost policy; tune principal semantic weight.
  • Next: hybrid retrieval + reranking prototype; re-run the 100-query suite and track PASS rate over time.
Appendix links on next slide.

Appendix: Included Artifacts

Source folder (workspace): tmp/sites/grouping.finalbuildgames.com/reports/