A federated system that finds patients with similar disease trajectories across institutions — without ever centralizing their data. One architecture, validated across two disease domains.
SOMA is a research initiative by Curadai, a company building AI infrastructure for non-profit and healthcare institutions. SOMA is not a product — it is the research foundation: a validated architecture for federated patient similarity that informs the clinical tools Curadai builds. The work is patent-pending, based on real clinical data, and conducted independently of the institutions whose data was used for validation.
Clinicians make better decisions when they can find patients with similar disease trajectories across institutions. But privacy regulations, data fragmentation, and missing temporal context have kept patient similarity networks out of the clinic.
Patient records siloed across institutions with incompatible EHR systems and clinical vocabularies.
HIPAA and GDPR prohibit centralization of protected health information across institutions.
Existing approaches treat patients as frozen in time, losing critical trajectory and progression information.
Complete multi-modal profiles are rare. Most patients have incomplete data across clinical, imaging, and genomic modalities.
Every claim below comes from real patient data across Alzheimer's disease and breast cancer cohorts. We tested seven core capabilities. All thresholds anchored to published clinician inter-rater agreement (κ), not arbitrary statistical conventions.
The encoder compresses clinical, imaging, genomic, and molecular data into a compact vector that preserves clinically meaningful disease structure. In oncology, PAM50 molecular subtypes achieve C@10 = 0.999 and AJCC staging reaches C@10 = 0.898. In neurology, CN vs Dementia separation hits C@10 = 0.849, and three-way classification (CN/MCI/Dementia) reaches C@10 = 0.576. The same architecture works for both — no domain-specific tuning required.
Adding noise to individual patient records destroys all useful signal — a fundamental limitation we confirmed across every setting. SOMA's two-tier workaround: aggregate population-level centroids across institutions, add differential privacy noise there, then use the protected aggregates to guide federated kNN matching. At a strict privacy budget of ε = 2, this retains 99.5–100% of matching quality across both domains. Privacy protection works when the underlying disease signal is strong.
SOMA matches patients by where their disease is heading, not just where it is now. Trajectory-aware twin retrieval identifies MCI converters with statistical significance (p = 0.012). In oncology, aggressive breast cancer subtypes show measurably faster embedding velocity than indolent ones. The core mechanism works — matching by trajectory, not just current state.
The textbook approach to combining data types — building separate similarity networks and averaging them — consistently made results worse. SOMA's learned gated fusion solves this: it figures out which data types matter for each disease and weights them accordingly. The result: +21% improvement in neurology and +79% in oncology over naive fusion. Gene expression dominates in oncology; clinical scores dominate in neurology. The architecture adapts without being told.
Real patients rarely have complete records. SOMA handles this gracefully: with 30% of modalities missing, degradation is just 4.1% in neurology and 8.2% in oncology. Zero-fill combined with gated attention means the encoder doesn't need complete data to produce useful patient representations. Trying to reconstruct missing data from other sources doesn't work — the information simply isn't there. But the encoder handles it anyway.
A model trained on one research cohort transfers to entirely different patient populations. TCGA→GENIE zero-shot transfer achieves C@10 = 0.949 — 985 patients to 15,565 across 100+ institutions with no retraining. In neurology, ADNI→ROSMAP zero-shot transfer reaches C@10 = 0.653. This validates the deployment model: train once on a curated reference, deploy across the network.
When a new hospital joins the network, does the model forget what it learned from previous sites? In neurology, EWC regularization holds forgetting to just 6.1% — strong disease signals are naturally resistant. In oncology, forgetting reaches 22.1% — weaker cross-domain signals need more protection. SOMA detects signal strength and applies regularization where needed, but oncology continual learning remains an active area of improvement.
Running the same architecture on neurology and oncology data exposed principles that no single-domain experiment could reveal.
Privacy, fusion, imputation — every downstream capability is bounded by how well the encoder captures disease structure in the first place. When embeddings are strong, privacy-preserving similarity works nearly losslessly. When they're weak, no amount of clever post-processing can compensate. Get the embedding right first — everything else follows.
In neurology, clinical scores drive patient similarity. In oncology, gene expression dominates and clinical features are nearly irrelevant. We didn't change the architecture between domains — the same encoder learned which data types carry signal and which are noise. A domain-agnostic design that adapts to whatever it sees.
Combining similarity networks with equal weight consistently made results worse — it dilutes strong signals with uninformative ones. SOMA's learned encoder produces high-quality patient representations on data where naive fusion produces random noise. The encoder learns to weight data types; mechanical averaging cannot.
Reconstructing one data type from another doesn't work — the information simply isn't there. But the encoder handles incomplete records gracefully: even with half the data types missing, quality stays above 96%. The practical takeaway: invest in data collection, not imputation algorithms.
When a new hospital joins the network, does the model forget earlier patients? It depends on signal strength. Strong disease signals are naturally robust to sequential training. Weaker ones benefit from targeted protection. SOMA detects this automatically and applies regularization only where it's needed.
The most important thing about SOMA is what the data becomes. Records go in. A 64-number vector comes out. That vector is the only thing that ever leaves the institution.
Raw patient records (clinical scores, MRI volumes, mutations, gene expression) are compressed through per-modality encoders into a shared vector on the unit hypersphere — d=64 for neurology, d=128 for oncology. Differential privacy (ε=2) is applied at egress via mTLS. By the end, a patient is a point on a sphere — and similar patients are nearby points.
Hospital dendrite nodes embed patients locally. The Soma Core matches trajectories across institutions. New hospitals onboard through the Blind Handshake protocol without exposing patient data.
Each institution embeds patients locally using clinical, imaging, genomic, and molecular data. Only compact mathematical representations leave the building — never patient records. Named after the biological dendrites that feed signals to the cell body.
Matches patients by where their disease is heading, not just where it is now. Performs fusion, matching, and continual learning. Works across neurology and oncology with the same core architecture. Named after the biological cell body that integrates signals from dendrites.
New institutions onboard through a four-stage protocol that calibrates their embedding space without exposing any patient data. Novel onboarding protocol — no existing work models federated onboarding as an adversarial setting with model extraction mitigations.
Adding noise to individual records destroys useful signal. SOMA works around this by aggregating at the population level first, where privacy protection is nearly lossless. Validated at strict privacy budgets.
Validated on established research cohorts spanning Alzheimer's disease and breast cancer, across institutions on three continents.
2,909 subjects from the Alzheimer's Disease Neuroimaging Initiative. Five data types: cognitive assessments, brain MRI, genetics, spinal fluid biomarkers, and PET imaging. 76 acquisition sites across North America.
885 subjects from the Rush Memory and Aging Project. Brain tissue gene expression replication cohort. Confirms neurology findings generalize beyond the primary cohort.
985 patients from The Cancer Genome Atlas. Three data types: clinical staging, somatic mutations, and gene expression. Molecular subtype classification and treatment stratification.
15,565 patients from AACR Project GENIE across 100+ contributing institutions worldwide. Cross-cohort transfer validation — a model trained on one cohort applied to an entirely different patient population with no retraining.
The core pipeline is validated across two disease domains. Next: a third domain, real multi-site deployment, and clinical partner pilots.
Extending validation to cardiovascular disease using ECG waveforms, cardiac cell atlases, and longitudinal clinical data. Testing whether the domain-agnostic claim holds for a third disease area.
Moving from research cohort validation to real multi-site deployment with clinical partners. Testing the Blind Handshake onboarding protocol with genuine institutional boundaries.
Detecting when patient populations shift over time and calibrating the system accordingly. Ensuring trajectory matching stays accurate as the network grows.
SOMA is the research foundation of Curadai — a company building AI infrastructure for non-profit and healthcare institutions. The architectural primitives validated here inform the clinical tools we build. SOMA itself is a research project, not a product; but the problems it solves are the ones our products address.