A federated system that finds patients with similar disease trajectories across institutions — without ever centralizing their data. One architecture, validated across two disease domains.
SOMA is a research initiative by Curadai, a company building AI infrastructure for non-profit and healthcare institutions. SOMA is not a product — it is the research foundation: a validated architecture for federated patient similarity that informs the clinical tools Curadai builds. The work is patent-pending, based on real clinical data, and conducted independently of the institutions whose data was used for validation.
Clinicians make better decisions when they can find patients with similar disease trajectories across institutions. But privacy regulations, data fragmentation, and missing temporal context have kept patient similarity networks out of the clinic.
Patient records siloed across institutions with incompatible EHR systems and clinical vocabularies.
HIPAA and GDPR prohibit centralization of protected health information across institutions.
Existing approaches treat patients as frozen in time, losing critical trajectory and progression information.
Complete multi-modal profiles are rare. Most patients have incomplete data across clinical, imaging, and genomic modalities.
Every claim below comes from real patient data across Alzheimer's disease and breast cancer cohorts. We tested seven core capabilities. Some held up. Some didn't. We report both.
The encoder compresses clinical, imaging, genomic, and molecular data into a compact mathematical representation that preserves clinically meaningful disease structure. In oncology, it correctly separates breast cancer molecular subtypes with 90% neighborhood accuracy. In neurology, it distinguishes Alzheimer's from healthy controls and identifies MCI patients likely to convert to dementia (p = 10⁻³⁹). The same architecture works for both — no domain-specific tuning required.
Adding noise to individual patient records destroys all useful signal — a fundamental limitation we confirmed across every setting. But SOMA's workaround works: aggregate population-level statistics across institutions, add noise there, then use the protected aggregates to guide patient matching. In oncology, this preserves 99.8% of matching quality at strict privacy budgets. The key insight: privacy protection works when the underlying disease signal is strong.
SOMA matches patients by where their disease is heading, not just where it is now. Trajectory-aware twin retrieval correlates with future cognitive decline at r = 0.473 (p < 0.0001), more than double the accuracy of snapshot-based matching (r = 0.198). In oncology, aggressive breast cancer subtypes show 3–6× faster embedding velocity than indolent ones. The core mechanism works; what remains is isolating velocity from position in slowly progressing early-stage disease.
The textbook approach to combining data types — building separate similarity networks and averaging them — consistently made results worse. It dilutes strong signals with uninformative ones. SOMA's learned encoder solves this differently: it figures out which data types matter for each disease and weights them accordingly. In oncology, gene expression dominates; in neurology, clinical scores do. The architecture adapts without being told.
Real patients rarely have complete records. SOMA handles this gracefully: with half the data types missing, embedding quality stays above 96% in neurology. Trying to reconstruct missing data from other sources doesn't work — the information simply isn't there. But the encoder doesn't need complete data to produce useful patient representations.
A model trained on one research cohort transfers to entirely different patient populations. An encoder trained on 981 cancer patients was applied to 15,565 patients from 100+ institutions with no retraining — and produced structured, meaningful patient groupings. This validates the deployment model: train once on a curated reference, deploy across the network.
When a new hospital joins the network, does the model forget what it learned from previous sites? In practice, forgetting is mild — well under 4% in realistic scenarios. Strong disease signals are naturally resistant to forgetting. Weaker signals benefit from targeted regularization that reduces forgetting 6×. SOMA detects signal strength and applies protection only where needed.
Running the same architecture on neurology and oncology data exposed principles that no single-domain experiment could reveal.
Privacy, fusion, imputation — every downstream capability is bounded by how well the encoder captures disease structure in the first place. When embeddings are strong, privacy-preserving similarity works nearly losslessly. When they're weak, no amount of clever post-processing can compensate. Get the embedding right first — everything else follows.
In neurology, clinical scores drive patient similarity. In oncology, gene expression dominates and clinical features are nearly irrelevant. We didn't change the architecture between domains — the same encoder learned which data types carry signal and which are noise. A domain-agnostic design that adapts to whatever it sees.
Combining similarity networks with equal weight consistently made results worse — it dilutes strong signals with uninformative ones. SOMA's learned encoder produces high-quality patient representations on data where naive fusion produces random noise. The encoder learns to weight data types; mechanical averaging cannot.
Reconstructing one data type from another doesn't work — the information simply isn't there. But the encoder handles incomplete records gracefully: even with half the data types missing, quality stays above 96%. The practical takeaway: invest in data collection, not imputation algorithms.
When a new hospital joins the network, does the model forget earlier patients? It depends on signal strength. Strong disease signals are naturally robust to sequential training. Weaker ones benefit from targeted protection. SOMA detects this automatically and applies regularization only where it's needed.
The most important thing about SOMA is what the data becomes. Records go in. A 64-number vector comes out. That vector is the only thing that ever leaves the institution.
Raw patient records (clinical scores, MRI volumes, mutations, gene expression) are compressed through per-modality encoders into a shared 64-dimensional vector on the unit hypersphere. The shape of the data changes at every stage. By the end, a patient is a point on a sphere — and similar patients are nearby points.
Hospital nodes embed patients locally. The central brain matches trajectories across institutions. New hospitals onboard through the Blind Handshake protocol without exposing patient data.
Each institution embeds patients locally using clinical, imaging, genomic, and molecular data. Only compact mathematical representations leave the building — never patient records.
Matches patients by where their disease is heading, not just where it is now. Learns which data types matter for each disease domain automatically. Works across neurology and oncology with the same core architecture.
New institutions onboard through a four-stage protocol that calibrates their embedding space without exposing any patient data. Novel onboarding protocol — no existing work models federated onboarding as an adversarial setting with model extraction mitigations.
Adding noise to individual records destroys useful signal. SOMA works around this by aggregating at the population level first, where privacy protection is nearly lossless. Validated at strict privacy budgets.
Validated on established research cohorts spanning Alzheimer's disease and breast cancer, across institutions on three continents.
3,692 subjects from the Alzheimer's Disease Neuroimaging Initiative. Five data types: cognitive assessments, brain MRI, genetics, spinal fluid biomarkers, and PET imaging. 76 acquisition sites across North America.
640 subjects from the Rush Memory and Aging Project. Brain tissue gene expression with cell-type atlas integration. Confirms neurology findings generalize beyond the primary cohort.
981 patients from The Cancer Genome Atlas. Three data types: clinical staging, somatic mutations, and gene expression. Molecular subtype classification and treatment stratification.
15,565 patients from AACR Project GENIE across 100+ contributing institutions worldwide. Cross-cohort transfer validation — a model trained on one cohort applied to an entirely different patient population with no retraining.
The core pipeline is validated across two disease domains. Next: a third domain, real multi-site deployment, and clinical partner pilots.
Extending validation to cardiovascular disease using ECG waveforms, cardiac cell atlases, and longitudinal clinical data. Testing whether the domain-agnostic claim holds for a third disease area.
Moving from research cohort validation to real multi-site deployment with clinical partners. Testing the Blind Handshake onboarding protocol with genuine institutional boundaries.
Detecting when patient populations shift over time and calibrating the system accordingly. Ensuring trajectory matching stays accurate as the network grows.
SOMA is the research foundation of Curadai — a company building AI infrastructure for non-profit and healthcare institutions. The architectural primitives validated here inform the clinical tools we build. SOMA itself is a research project, not a product; but the problems it solves are the ones our products address.