SOMA — Federated Clinical Twin Finder for Precision Medicine

The Problem

Finding your patient's clinical twin shouldn't require centralizing data.

Clinicians make better decisions when they can find patients with similar disease trajectories across institutions. But privacy regulations, data fragmentation, and missing temporal context have kept patient similarity networks out of the clinic.

Data Fragmentation

Patient records siloed across institutions with incompatible EHR systems and clinical vocabularies.

Privacy Regulation

HIPAA and GDPR prohibit centralization of protected health information across institutions.

Static Snapshots

Existing approaches treat patients as frozen in time, losing critical trajectory and progression information.

Missing Modalities

Complete multi-modal profiles are rare. Most patients have incomplete data across clinical, imaging, and genomic modalities.

Validated on Real Data

What we proved — and what we didn't.

Every claim below comes from real patient data across Alzheimer's disease and breast cancer cohorts. We tested seven core capabilities. All thresholds anchored to published clinician inter-rater agreement (κ), not arbitrary statistical conventions.

Patient Embedding Validated

The encoder compresses clinical, imaging, genomic, and molecular data into a compact vector that preserves clinically meaningful disease structure. In oncology, PAM50 molecular subtypes achieve C@10 = 0.999 and AJCC staging reaches C@10 = 0.898. In neurology, CN vs Dementia separation hits C@10 = 0.849, and three-way classification (CN/MCI/Dementia) reaches C@10 = 0.576. The same architecture works for both — no domain-specific tuning required.

Privacy-Preserving Similarity Validated

Adding noise to individual patient records destroys all useful signal — a fundamental limitation we confirmed across every setting. SOMA's two-tier workaround: aggregate population-level centroids across institutions, add differential privacy noise there, then use the protected aggregates to guide federated kNN matching. At a strict privacy budget of ε = 2, this retains 99.5–100% of matching quality across both domains. Privacy protection works when the underlying disease signal is strong.

Trajectory Matching Validated

SOMA matches patients by where their disease is heading, not just where it is now. Trajectory-aware twin retrieval identifies MCI converters with statistical significance (p = 0.012). In oncology, aggressive breast cancer subtypes show measurably faster embedding velocity than indolent ones. The core mechanism works — matching by trajectory, not just current state.

Multi-Modal Fusion Validated

The textbook approach to combining data types — building separate similarity networks and averaging them — consistently made results worse. SOMA's learned gated fusion solves this: it figures out which data types matter for each disease and weights them accordingly. The result: +21% improvement in neurology and +79% in oncology over naive fusion. Gene expression dominates in oncology; clinical scores dominate in neurology. The architecture adapts without being told.

Missing Data Tolerance Validated

Real patients rarely have complete records. SOMA handles this gracefully: with 30% of modalities missing, degradation is just 4.1% in neurology and 8.2% in oncology. Zero-fill combined with gated attention means the encoder doesn't need complete data to produce useful patient representations. Trying to reconstruct missing data from other sources doesn't work — the information simply isn't there. But the encoder handles it anyway.

Cross-Cohort Transfer Validated

A model trained on one research cohort transfers to entirely different patient populations. TCGA→GENIE zero-shot transfer achieves C@10 = 0.949 — 985 patients to 15,565 across 100+ institutions with no retraining. In neurology, ADNI→ROSMAP zero-shot transfer reaches C@10 = 0.653. This validates the deployment model: train once on a curated reference, deploy across the network.

Continual Learning Partial

When a new hospital joins the network, does the model forget what it learned from previous sites? In neurology, EWC regularization holds forgetting to just 6.1% — strong disease signals are naturally resistant. In oncology, forgetting reaches 22.1% — weaker cross-domain signals need more protection. SOMA detects signal strength and applies regularization where needed, but oncology continual learning remains an active area of improvement.

What We Learned

Testing across two disease domains changed how we build.

Running the same architecture on neurology and oncology data exposed principles that no single-domain experiment could reveal.

Embedding quality is the multiplier.

Privacy, fusion, imputation — every downstream capability is bounded by how well the encoder captures disease structure in the first place. When embeddings are strong, privacy-preserving similarity works nearly losslessly. When they're weak, no amount of clever post-processing can compensate. Get the embedding right first — everything else follows.

The encoder figures out what matters. We don't tell it.

In neurology, clinical scores drive patient similarity. In oncology, gene expression dominates and clinical features are nearly irrelevant. We didn't change the architecture between domains — the same encoder learned which data types carry signal and which are noise. A domain-agnostic design that adapts to whatever it sees.

Simple averaging destroys signal. Learned fusion preserves it.

Combining similarity networks with equal weight consistently made results worse — it dilutes strong signals with uninformative ones. SOMA's learned encoder produces high-quality patient representations on data where naive fusion produces random noise. The encoder learns to weight data types; mechanical averaging cannot.

Collect the right data. Don't try to hallucinate what's missing.

Reconstructing one data type from another doesn't work — the information simply isn't there. But the encoder handles incomplete records gracefully: even with half the data types missing, quality stays above 96%. The practical takeaway: invest in data collection, not imputation algorithms.

Strong signals don't forget. Weak ones need a safety net.

When a new hospital joins the network, does the model forget earlier patients? It depends on signal strength. Strong disease signals are naturally robust to sequential training. Weaker ones benefit from targeted protection. SOMA detects this automatically and applies regularization only where it's needed.

Architecture

What happens to patient data at each stage.

The most important thing about SOMA is what the data becomes. Records go in. A 64-number vector comes out. That vector is the only thing that ever leaves the institution.

Patient Records

N × D raw

→

Modality
Encoders

Per-type

→

Bottleneck
Fusion

N × 64

→

Unit
Hypersphere

S⁶³

→

Similarity
Matching

cosine

Raw patient records (clinical scores, MRI volumes, mutations, gene expression) are compressed through per-modality encoders into a shared vector on the unit hypersphere — d=64 for neurology, d=128 for oncology. Differential privacy (ε=2) is applied at egress via mTLS. By the end, a patient is a point on a sphere — and similar patients are nearby points.

Network Topology

Hospital dendrite nodes embed patients locally. The Soma Core matches trajectories across institutions. New hospitals onboard through the Blind Handshake protocol without exposing patient data.

▶ Watch the Ecosystem Simulation

Dendrite Nodes

Each institution embeds patients locally using clinical, imaging, genomic, and molecular data. Only compact mathematical representations leave the building — never patient records. Named after the biological dendrites that feed signals to the cell body.

Soma Core

Matches patients by where their disease is heading, not just where it is now. Performs fusion, matching, and continual learning. Works across neurology and oncology with the same core architecture. Named after the biological cell body that integrates signals from dendrites.

Blind Handshake

New institutions onboard through a four-stage protocol that calibrates their embedding space without exposing any patient data. Novel onboarding protocol — no existing work models federated onboarding as an adversarial setting with model extraction mitigations.

Privacy Architecture

Adding noise to individual records destroys useful signal. SOMA works around this by aggregating at the population level first, where privacy protection is nearly lossless. Validated at strict privacy budgets.

Validation Data

Real patients. Real clinical data. Two disease domains.

Validated on established research cohorts spanning Alzheimer's disease and breast cancer, across institutions on three continents.

Alzheimer's Disease

2,909 subjects from the Alzheimer's Disease Neuroimaging Initiative. Five data types: cognitive assessments, brain MRI, genetics, spinal fluid biomarkers, and PET imaging. 76 acquisition sites across North America.

Independent Replication

885 subjects from the Rush Memory and Aging Project. Brain tissue gene expression replication cohort. Confirms neurology findings generalize beyond the primary cohort.

Breast Cancer

985 patients from The Cancer Genome Atlas. Three data types: clinical staging, somatic mutations, and gene expression. Molecular subtype classification and treatment stratification.

Multi-Site Transfer

15,565 patients from AACR Project GENIE across 100+ contributing institutions worldwide. Cross-cohort transfer validation — a model trained on one cohort applied to an entirely different patient population with no retraining.

Patent pending (U.S.) · Priority date 2026. The federated embedding architecture, Blind Handshake onboarding protocol, trajectory velocity encoding, and privacy-preserving aggregation mechanism are covered by pending patent claims. Institutions considering engagement should contact us for licensing terms.

What's Next

From validation to clinical utility.

The core pipeline is validated across two disease domains. Next: a third domain, real multi-site deployment, and clinical partner pilots.

Third Domain: Cardiology

Extending validation to cardiovascular disease using ECG waveforms, cardiac cell atlases, and longitudinal clinical data. Testing whether the domain-agnostic claim holds for a third disease area.

Clinical Partner Pilots

Moving from research cohort validation to real multi-site deployment with clinical partners. Testing the Blind Handshake onboarding protocol with genuine institutional boundaries.

Longitudinal Monitoring

Detecting when patient populations shift over time and calibrating the system accordingly. Ensuring trajectory matching stays accurate as the network grows.