MedWeight Assessment System

MedWeight Assessment System
Instrument Definition Testing Report
Prepared for Dr. Michael Lyon, Clinical Director · Report Date: March 26, 2026 · Test Suite Version: 1.0 — 1,714 automated tests · Pass Rate: 100%
View Full Report
Download PDF
1. Executive Summary
This report documents the creation and automated testing of 11 clinical assessment instrument definitions for the MedWeight patient engagement platform. Each instrument is defined as a structured JSON file that drives the assessment engine, scoring engine, phenotype classification, coaching session check-ins, and cross-instrument trigger logic.
Test Suite
1,714 automated tests across 16 test groups
Coverage
Structural integrity, scoring correctness, phenotype classification, trigger chain validation, dynamic prompt generation, and clinical flag evaluation
Pass Rate
All 11 instruments pass with a 100% pass rate after two minor corrections applied during testing
Eight additional instruments (TNSDA, RRVA, WTPTHA, BIWSSA, SSSHEA, NRAF-EF, BLPA, MRCA) are pending source document upload and will be tested to the same standard upon completion.
2. Instruments Tested
The following 11 instruments were defined and tested. Total item count across all instruments: 433 items.
3. Testing Methodology
The test suite (test_instruments.py) loads all 11 JSON instrument definitions and runs 16 test groups. Each test is a discrete assertion that either passes or fails with a specific diagnostic message. The tests are designed to catch both structural errors (malformed JSON, missing fields, broken references) and clinical logic errors (incorrect scoring, wrong phenotype classification, failed trigger chains).
3.1 Test Groups
1
Structural Integrity
Validates required top-level keys, valid category values, module_id matching filename, section structure with required fields, and item_count matching actual items.
2
Item ID Sequencing
Verifies all item IDs are unique, use the correct module prefix, and follow sequential numbering (1 through N).
3
Reverse Scoring
Confirms every reverse-scored item exists in the instrument, and that inline reverse_scored markers on individual items match the scoring block's reverse_scored_items list.
4
Domain/Scoring Alignment
Validates that section domain_keys match the scoring block's domain list, with special handling for SACA's apnea domain, CEFRA's trigger_load scoring path, and CANLA's knowledge scoring.
1
Severity Thresholds
Checks that all threshold ranges start at 0, cover the full scale, and have no gaps between adjacent ranges.
2
Phenotype Rules
Validates each phenotype has required fields and references only domains that exist in the instrument.
3
Trigger Rules
Validates cross-instrument triggers reference existing modules and existing domains within those modules.
4
Short-Form Sentinels
Confirms all sentinel items exist in the scorable item set, counts match declared short_form_count, MI rewordings exist for every sentinel, and sentinels cover at least 50% of domains.
1
Scoring Engine Simulation
Scores 7 synthetic patients across 9 instruments, validates domain scores, global scores, severity classification, and reverse scoring correctness.
2
Cross-Instrument Triggers
Evaluates the full trigger chain from MDOA through downstream modules (SEIM, DOMM, CEFRA, LOCEA, SACA).
3
EBCA Graded Severity
Tests sum-based scoring, LOC feature detection, distress detection, and probable BED classification.
4
FSMCA Inverted Thresholds
Confirms competence-based threshold direction (higher = stronger) is correctly structured.
1
Dynamic Prompt Generation
Tests MI rewording selection prioritized by domain severity for coaching session check-ins.
2
CRSEM Coaching Stance
Tests coaching modality selection based on readiness/ambivalence profile, including the safety rule against action-heavy coaching for low-readiness patients.
3
PHQ-9 Integration
Tests DOMM's three-way triage logic combining PHQ-9 elevation with DOMM global score.
4
Multi-Step Trigger Chain
Validates the full cascade MDOA → SEIM → SACA with apnea escalation flag.
3.2 Synthetic Patient Profiles
Seven synthetic patients were constructed with clinically coherent response patterns to test scoring and phenotype classification under realistic conditions:
1
Patient A
Moderate stress-driven emotional eater with poor sleep (SEIM, SACA)
2
Patient B
Depression-driven with low food skills and low readiness (DOMM, FSMCA, CRSEM)
3
Patient C
Compulsive eater with severe LOC and BED features (LOCEA, CEFRA, EBCA)
4
Patient D
High nutrition knowledge / perfect CANLA score (CANLA)
5
Patient E
MDOA mood-driven phenotype triggering downstream assessments (MDOA)
6
Patient F
Severe mechanical barriers and fatigue-driven inactivity (MCAA)
7
Patient G
Apnea red flags requiring clinician escalation (SACA)
4. Instrument-by-Instrument Results
4.1 MDOA — Mental Drivers of Obesity Assessment
Structure
64 items across 8 sections: 6 Likert-scored sections (A–F, 8 items each), 1 categorical pattern section (G, 10 items), and 1 binary risk-flag section (H, 6 items). This is the master intake instrument and the source of all downstream trigger rules. The MDOA was developed to the highest precision by the medical director and serves as the structural template for all other definitions.
Scoring
Global score = mean of items 1–48 (Likert sections only). Six domain means: mood, reward, satiety, LOC, executive, shame. No reverse-scored items.
Phenotypes
8 phenotype rules including mood-driven, reward-dominant, satiety/glycemic, LOC/compulsive, executive/chaos, shame/rigidity, mixed high-complexity, and high-complexity.
Triggers
6 trigger rules fire downstream modules: mood ≥ 2.0 triggers SEIM and DOMM; LOC ≥ 2.0 triggers LOCEA; reward ≥ 2.0 triggers CEFRA; executive ≥ 2.5 triggers NRAF-EF; shame ≥ 2.5 triggers BIWSSA.
Test Results
All structural, scoring, phenotype, and trigger tests pass. Synthetic Patient E (mood=3.0) correctly classifies as mood_driven phenotype and fires SEIM, DOMM, and CEFRA triggers.
4.2 SEIM — Stress-Eating Integration Module
32 items total: 28 Likert (4 sections × 7 items) + 4 categorical context items. All item text transcribed verbatim from the Coaching Guidelines (lines 3193–3510). Short-form sentinel set uses the exact 8 items specified in the source (items 2, 5, 10, 12, 17, 19, 22, 25). MI rewordings are composed (the source specified the sentinel item numbers but did not provide MI rewordings for SEIM). No reverse-scored items per source.
Scoring
Four domain means: emotional_trigger, stress_physiology, glycemic_instability, reward_override. Global = mean of items 1–28.
Clinical Flags
Clinically significant when global ≥ 2.0 or 2 domains ≥ 2.5. Dominant target when global ≥ 2.5 or 1 domain ≥ 3.0 with recurrent overeating.
Test Results
All tests pass. Patient A correctly scores emotional_trigger=3.0 as highest domain, classifies as emotional_trigger_dominant phenotype. Trigger chain from SEIM stress_physiology=2.0 correctly fires SACA.
4.3 DOMM — Depression-Obesity Mechanism Module
28 items total: 24 Likert (4 sections × 6 items) + 4 categorical context items. All item text transcribed verbatim from the Coaching Guidelines (lines 3746–4040). Short-form sentinel set uses the exact 8 items specified in the source (items 3, 5, 9, 10, 14, 16, 20, 24). MI rewordings are composed (same situation — source specified sentinel numbers only). No reverse-scored items per source. PHQ-9 integration triage logic is included in the scoring block.
Scoring
Four domain means: eating_impact, satiety_disruption, hedonic_compensation, adherence_impairment. Global = mean of items 1–24.
PHQ-9 Integration
Three triage paths: PHQ-9 elevated + DOMM low; PHQ-9 elevated + DOMM elevated; PHQ-9 modest + DOMM high (subthreshold depression clinically important).
Test Results
All tests pass. Patient B correctly scores adherence_impairment=3.0 as highest/severe, classifies as adherence_impairment phenotype. PHQ-9 integration triage correctly resolves all three paths.
4.4 LOCEA — Loss of Control Eating Assessment
30 Likert items across 6 sections (5 items each). All item text transcribed verbatim. One reverse-scored item: locea_25 ("My appetite feels more stable when I eat whole, balanced meals"). Short-form sentinel set uses the exact 8 items specified in the source (1, 3, 5, 9, 14, 19, 23, 30). MI rewordings are composed. Note the response scale uses "Almost always" at the top anchor (per source) rather than "Very often" — this is faithful to the Coaching Guidelines text.
Test Results: All tests pass. Reverse scoring on locea_25 correctly computes satiety_glycemic=2.2 for Patient C. One minor fix applied during testing: added inline reverse_scored:true marker on locea_25 (scoring was already correct, marker was missing for structural consistency).
4.5 EBCA — Eating Behavior Clinical Assessment
17 items across 3 sections. Fundamentally different structure from the Likert instruments: Section A uses graded severity selection (A=0 through D=3, sum-scored to /30), Section B uses compensatory behavior frequency (flag-based), Section C uses restriction severity screening (flag-based). No short form (the source did not specify one, and the instrument is already brief at 17 items). All three phenotype rules carry escalation:true — these are eating disorder screening flags that route to clinician review. All item text transcribed verbatim.
Test Results: All tests pass. Sum scoring validated at 30/30 for all-severe responses. Probable BED classification correctly fires when score ≥ 19 + LOC features + distress. Bulimia and restrictive pattern screen flags validated structurally.
4.6 CEFRA — Compulsive Eating and Food Reward Assessment
48 items across 8 sections. The most complex instrument: 32 Likert items (Sections A–F), 10 trigger-food mapping items (Section G with its own severity scale), and 6 clinical pattern items (Section H — 3 binary flags + 3 categorical). Core scoring uses items 1–32 only. Trigger Load Score computed separately from items 33–42. Addictive-pattern classification uses a multi-domain flag system per source. Short-form sentinel set uses the exact 8 items specified (1, 4, 8, 12, 14, 19, 23, 29). MI rewordings are composed. No reverse-scored items per source.
Test Results: All tests pass. Patient C correctly scores global=3.0 (severe) with all 6 core domains flagged. One minor fix applied during testing: changed Section G scoring_method from "mean" to "trigger_mean" to clearly distinguish trigger_load from the 6 core symptom domains in the scoring engine.
4.7 CANLA — Comprehensive Applied Nutrition Literacy Assessment
60 multiple-choice items across 6 domains (10 items each). instrument_type: "knowledge" with scoring_method: "knowledge" — fundamentally different from the Likert instruments. Every item has a correct_answer field. Scored as sum of correct answers (/60) with three-tier thresholds (Low 0–20, Moderate 21–40, High 41–60). web_form_enabled: true, conversational_enabled: false per the decision that CANLA is web-form only. No short form, no phenotype rules (this is a knowledge test, not a clinical profiler). All item text and answer options transcribed verbatim. Correct answers are derived — the source did not provide an explicit answer key, but every item has one unambiguously correct option.
Test Results
All tests pass. Patient D correctly scores 60/60 with perfect domain scores (10/10 across all 6 domains).
4.8 FSMCA — Food Skills and Meal Competency Assessment
52 items total: 48 Likert across 6 sections (8 items each) + 4 categorical pattern items. 9 reverse-scored items explicitly listed in the source (items 5, 13, 23, 29, 33, 34, 35, 37, 39) — each is marked both in the item definition (reverse_scored: true) and in the scoring block. The FSMCA scoring direction is inverted compared to the psychology instruments: higher scores = greater competence (strength), lower scores = impairment. The severity thresholds reflect this (3.0–4.0 = "Strong", 0.0–0.9 = "Severe impairment"). Short form uses the 12-item sentinel set specified in the source (items 1, 3, 7, 10, 13, 20, 25, 30, 33, 36, 42, 47) — not 8 items, because the source specified 12 for this instrument. MI rewordings are composed. All item text transcribed verbatim.
Test Results: All tests pass. Patient B correctly scores global=1.0 (limited) with all domains at 1.0 after reverse scoring. Inverted threshold structure validated: "strong" at 3.0–4.0, "severe_impairment" at 0.0–0.9.
4.9 SACA — Sleep-Appetite-Circadian Assessment
36 items across 6 sections (6 items each). All item text transcribed verbatim. Two response scales: Sections A–E use the standard likert_0_4 frequency scale, while Section F (Sleep Apnea Risk Screen) uses a differentiated apnea_severity_0_4 scale (No / Unsure-occasional / Mild / Frequent / Severe) as specified in the source. Global score excludes the apnea section (mean of items 1–30 only). Two clinical flags for apnea escalation: any single item ≥ 3 = high suspicion, domain mean ≥ 2.0 = screen further (STOP-BANG/sleep study). No reverse-scored items. Short form is composed (source specifies short form use at onboarding/follow-up but not specific items) — 6 sentinel items, one per domain. MI rewordings composed.
Test Results
All tests pass. Patient A correctly scores fatigue_eating=3.0 as highest domain. Patient G correctly triggers apnea escalation with domain=3.0 and multiple individual items ≥ 3.
4.10 MCAA — Movement-Capacity-Adherence Assessment
36 items across 6 sections (6 items each). All item text transcribed verbatim. Mixed scoring direction explicitly documented: Section A (Actual Activity) is a capability scale where higher = better, while Sections B–F are barrier/impairment scales where higher = worse. One reverse-scored item: mcaa_05 ("I avoid activity even when I could do it") within Section A. The scoring_direction field on each section makes this explicit for the scoring engine. Short form is composed — 6 sentinel items, one per domain. MI rewordings composed.
Test Results: All tests pass. Patient F correctly scores mechanical=4.0 (severe), activity=1.0 (low after reverse on item 5). Mechanical limitation phenotype correctly classified.
4.11 CRSEM — Change Readiness and Self-Efficacy Module
30 Likert items across 5 sections (6 items each). All item text transcribed verbatim. 5 reverse-scored items exactly as specified in the source (items 6, 12, 16, 24, 28) — each marked both inline and in the scoring block. The scoring thresholds use readiness-specific labels (Not ready / Low / Moderate / High) rather than the standard severity labels — faithful to source. The coaching_rule field captures the key clinical constraint: "Never give action-heavy coaching to a low-readiness, high-ambivalence patient." Each phenotype includes a coaching_stance field mapping to MI/CBT/ACT modality selection. Short form is composed — 6 items, one per domain plus a second recovery item. MI rewordings composed. Timeframe is "Current state" (not "Past 4 weeks") per the source.
Test Results
All tests pass. Patient B correctly scores confidence=1.0, ambivalence=3.0 after reverse scoring. Coaching stance selection correctly returns MI_exploration with the safety note against directive coaching. High-readiness patient correctly selects performance_coaching.
5. Cross-Instrument Integration Results
The most clinically important tests validate the trigger chain — the cascading assessment logic where one instrument's scores automatically determine which additional instruments should be administered.
5.1 MDOA Trigger Chain
When the MDOA is scored with mood=3.0, reward=2.0, satiety=2.0, LOC=1.0, executive=1.0, shame=2.0, the following triggers fire correctly:
SEIM triggered (mood ≥ 2.0, reward ≥ 2.0, satiety ≥ 2.0 — three independent trigger conditions)
DOMM triggered (mood ≥ 2.0)
CEFRA triggered (reward ≥ 2.0)
LOCEA NOT triggered (LOC = 1.0, threshold is 2.0) — correct negative
NRAF-EF NOT triggered (executive = 1.0, threshold is 2.5) — correct negative
5.2 Multi-Step Cascade
The test validates a three-step cascade: MDOA → SEIM → SACA. When the MDOA fires the SEIM trigger, the SEIM is scored, and if stress_physiology ≥ 2.0, the SACA is triggered in turn. If the SACA apnea screen is positive (domain ≥ 2.0 or any item ≥ 3), a clinician escalation flag is raised. This full chain was validated end-to-end.
5.3 Dynamic Prompt Generation
The coaching session check-in system correctly selects MI rewordings prioritized by domain severity. For each instrument, the top 3 most elevated domains drive the session questions. This was validated for SEIM (emotional_trigger leads), DOMM (adherence_impairment leads), CRSEM (ambivalence leads), and FSMCA (distributed across equally-low domains).
6. Corrections Applied During Testing
Two minor corrections were applied during testing:
LOCEA (locea_25)
Added inline reverse_scored:true marker on the item object. The scoring block already listed locea_25 in reverse_scored_items and the scoring engine was applying the reverse correctly. The marker was missing only as an inline annotation for structural consistency with how FSMCA and CRSEM mark their reverse-scored items.
CEFRA (Section G)
Changed scoring_method from "mean" to "trigger_mean" for the Trigger Food Mapping section. The trigger_load domain is intentionally scored separately from the 6 core symptom domains (it uses a different scale and feeds into the Trigger Load Score, not the Core Total Mean). The original "mean" value caused the domain alignment test to flag trigger_load as a missing core domain. The fix makes the separation explicit in the schema.
7. Instruments Pending
Eight instruments are pending source document upload. All were delivered as Word document downloads in the original clinical development conversation and their full item text is not present in the Coaching Guidelines transcript. Domain architectures are documented in the transcript and will be used to validate the definitions once item text is available:
TNSDA
Trauma / Nervous System Dysregulation Assessment
6–7 domains, ~36 items
RRVA
Relapse and Regain Vulnerability Assessment
6 domains, ~36 items
WTPTHA
Weight Trajectory and Prior Treatment History
6 domains, mixed format with composite indices
BIWSSA
Body Image, Weight Stigma, and Shame Assessment
6 domains, ~36 items
SSSHEA
Social Support, Sabotage, and Household Environment Assessment
6 domains, ~36 items
NRAF-EF
Neurobehavioral Regulation / ADHD-Executive Function
6 domains, ~36 items
BLPA
Behavioral Learning Profile Assessment
Coaching style profiler, non-severity scoring
MRCA
Meal Rhythm and Chrononutrition Assessment
6 domains, ~30–36 items
8. Conclusion
All 11 deployed instrument definitions pass comprehensive automated testing with a 100% pass rate across 1,714 tests. The JSON structures correctly support:
Scoring & Reverse Scoring
Scoring, reverse scoring, phenotype classification, and cross-instrument trigger logic
Clinical Flags & Integration
Clinical flag evaluation, coaching stance selection, PHQ-9 integration triage, and apnea escalation
Dynamic Prompts & Cascade
Dynamic MI-based prompt generation and the full assessment cascade from MDOA through downstream instruments
The definitions are ready for integration with the assessment engine (assessment_engine.py) and the web-form rendering system (assessment.php). Upon receipt of the 8 pending source Word documents, the remaining definitions will be built to the same standard and subjected to the same test suite.