SOMA

A federated system that finds patients with similar disease trajectories across institutions — without ever centralizing their data. One architecture, validated across two disease domains.

Patent Pending (U.S.) · Sebastian T. Muah · Priority 2026

See the Results ← Curadai Home
20,000+
Real patients validated
ADNI, TCGA-BRCA, GENIE cohorts
2
Disease domains
Neurology & Oncology
100+
Source institutions
In validation cohort data
Zero
Patient records shared
Only embeddings leave each site

SOMA is a research initiative by Curadai, a company building AI infrastructure for non-profit and healthcare institutions. SOMA is not a product — it is the research foundation: a validated architecture for federated patient similarity that informs the clinical tools Curadai builds. The work is patent-pending, based on real clinical data, and conducted independently of the institutions whose data was used for validation.

The Problem

Finding your patient's clinical twin shouldn't require centralizing data.

Clinicians make better decisions when they can find patients with similar disease trajectories across institutions. But privacy regulations, data fragmentation, and missing temporal context have kept patient similarity networks out of the clinic.

Data Fragmentation

Patient records siloed across institutions with incompatible EHR systems and clinical vocabularies.

Privacy Regulation

HIPAA and GDPR prohibit centralization of protected health information across institutions.

Static Snapshots

Existing approaches treat patients as frozen in time, losing critical trajectory and progression information.

Missing Modalities

Complete multi-modal profiles are rare. Most patients have incomplete data across clinical, imaging, and genomic modalities.

Validated on Real Data

What we proved — and what we didn't.

Every claim below comes from real patient data across Alzheimer's disease and breast cancer cohorts. We tested seven core capabilities. Some held up. Some didn't. We report both.

Patient Embedding Validated

The encoder compresses clinical, imaging, genomic, and molecular data into a compact mathematical representation that preserves clinically meaningful disease structure. In oncology, it correctly separates breast cancer molecular subtypes with 90% neighborhood accuracy. In neurology, it distinguishes Alzheimer's from healthy controls and identifies MCI patients likely to convert to dementia (p = 10⁻³⁹). The same architecture works for both — no domain-specific tuning required.

Privacy-Preserving Similarity Validated

Adding noise to individual patient records destroys all useful signal — a fundamental limitation we confirmed across every setting. But SOMA's workaround works: aggregate population-level statistics across institutions, add noise there, then use the protected aggregates to guide patient matching. In oncology, this preserves 99.8% of matching quality at strict privacy budgets. The key insight: privacy protection works when the underlying disease signal is strong.

Trajectory Matching Validated

SOMA matches patients by where their disease is heading, not just where it is now. Trajectory-aware twin retrieval correlates with future cognitive decline at r = 0.473 (p < 0.0001), more than double the accuracy of snapshot-based matching (r = 0.198). In oncology, aggressive breast cancer subtypes show 3–6× faster embedding velocity than indolent ones. The core mechanism works; what remains is isolating velocity from position in slowly progressing early-stage disease.

Multi-Modal Fusion Reframed

The textbook approach to combining data types — building separate similarity networks and averaging them — consistently made results worse. It dilutes strong signals with uninformative ones. SOMA's learned encoder solves this differently: it figures out which data types matter for each disease and weights them accordingly. In oncology, gene expression dominates; in neurology, clinical scores do. The architecture adapts without being told.

Missing Data Tolerance Validated

Real patients rarely have complete records. SOMA handles this gracefully: with half the data types missing, embedding quality stays above 96% in neurology. Trying to reconstruct missing data from other sources doesn't work — the information simply isn't there. But the encoder doesn't need complete data to produce useful patient representations.

Cross-Cohort Transfer Validated

A model trained on one research cohort transfers to entirely different patient populations. An encoder trained on 981 cancer patients was applied to 15,565 patients from 100+ institutions with no retraining — and produced structured, meaningful patient groupings. This validates the deployment model: train once on a curated reference, deploy across the network.

Continual Learning Partial

When a new hospital joins the network, does the model forget what it learned from previous sites? In practice, forgetting is mild — well under 4% in realistic scenarios. Strong disease signals are naturally resistant to forgetting. Weaker signals benefit from targeted regularization that reduces forgetting 6×. SOMA detects signal strength and applies protection only where needed.

What We Learned

Testing across two disease domains changed how we build.

Running the same architecture on neurology and oncology data exposed principles that no single-domain experiment could reveal.

Embedding quality is the multiplier.

Privacy, fusion, imputation — every downstream capability is bounded by how well the encoder captures disease structure in the first place. When embeddings are strong, privacy-preserving similarity works nearly losslessly. When they're weak, no amount of clever post-processing can compensate. Get the embedding right first — everything else follows.

The encoder figures out what matters. We don't tell it.

In neurology, clinical scores drive patient similarity. In oncology, gene expression dominates and clinical features are nearly irrelevant. We didn't change the architecture between domains — the same encoder learned which data types carry signal and which are noise. A domain-agnostic design that adapts to whatever it sees.

Simple averaging destroys signal. Learned fusion preserves it.

Combining similarity networks with equal weight consistently made results worse — it dilutes strong signals with uninformative ones. SOMA's learned encoder produces high-quality patient representations on data where naive fusion produces random noise. The encoder learns to weight data types; mechanical averaging cannot.

Collect the right data. Don't try to hallucinate what's missing.

Reconstructing one data type from another doesn't work — the information simply isn't there. But the encoder handles incomplete records gracefully: even with half the data types missing, quality stays above 96%. The practical takeaway: invest in data collection, not imputation algorithms.

Strong signals don't forget. Weak ones need a safety net.

When a new hospital joins the network, does the model forget earlier patients? It depends on signal strength. Strong disease signals are naturally robust to sequential training. Weaker ones benefit from targeted protection. SOMA detects this automatically and applies regularization only where it's needed.

Architecture

What happens to patient data at each stage.

The most important thing about SOMA is what the data becomes. Records go in. A 64-number vector comes out. That vector is the only thing that ever leaves the institution.

Patient Records
N × D raw
Modality
Encoders
Per-type
Bottleneck
Fusion
N × 64
Unit
Hypersphere
S63
Similarity
Matching
cosine

Raw patient records (clinical scores, MRI volumes, mutations, gene expression) are compressed through per-modality encoders into a shared 64-dimensional vector on the unit hypersphere. The shape of the data changes at every stage. By the end, a patient is a point on a sphere — and similar patients are nearby points.

Network Topology

Hospital nodes embed patients locally. The central brain matches trajectories across institutions. New hospitals onboard through the Blind Handshake protocol without exposing patient data.

SOMA Central Brain Learned Fusion · Velocity Encoding · Trajectory Matching Embeddings only · Zero patient data shared Embeddings only · Zero patient data shared Blind Handshake (onboarding) Hospital A Edge Node Hospital B Edge Node Hospital C Edge Node Research Lab Edge Node Genomics Center Edge Node New Institution Joining... Active connection Onboarding Hospital Research

Edge Nodes

Each institution embeds patients locally using clinical, imaging, genomic, and molecular data. Only compact mathematical representations leave the building — never patient records.

Central Brain

Matches patients by where their disease is heading, not just where it is now. Learns which data types matter for each disease domain automatically. Works across neurology and oncology with the same core architecture.

Blind Handshake

New institutions onboard through a four-stage protocol that calibrates their embedding space without exposing any patient data. Novel protocol — no prior art exists.

Privacy Architecture

Adding noise to individual records destroys useful signal. SOMA works around this by aggregating at the population level first, where privacy protection is nearly lossless. Validated at strict privacy budgets.

Validation Data

Real patients. Real clinical data. Two disease domains.

Validated on established research cohorts spanning Alzheimer's disease and breast cancer, across institutions on three continents.

Alzheimer's Disease

3,692 subjects from the Alzheimer's Disease Neuroimaging Initiative. Five data types: cognitive assessments, brain MRI, genetics, spinal fluid biomarkers, and PET imaging. 76 acquisition sites across North America.

Independent Replication

640 subjects from the Rush Memory and Aging Project. Brain tissue gene expression with cell-type atlas integration. Confirms neurology findings generalize beyond the primary cohort.

Breast Cancer

981 patients from The Cancer Genome Atlas. Three data types: clinical staging, somatic mutations, and gene expression. Molecular subtype classification and treatment stratification.

Multi-Site Transfer

15,565 patients from AACR Project GENIE across 100+ contributing institutions worldwide. Cross-cohort transfer validation — a model trained on one cohort applied to an entirely different patient population with no retraining.

Patent pending (U.S.) · Priority date 2025. The federated embedding architecture, Blind Handshake onboarding protocol, trajectory velocity encoding, and privacy-preserving aggregation mechanism are covered by pending patent claims. Institutions considering engagement should contact us for licensing terms.

What's Next

From validation to clinical utility.

The core pipeline is validated across two disease domains. Next: a third domain, real multi-site deployment, and clinical partner pilots.

Third Domain: Cardiology

Extending validation to cardiovascular disease using ECG waveforms, cardiac cell atlases, and longitudinal clinical data. Testing whether the domain-agnostic claim holds for a third disease area.

Clinical Partner Pilots

Moving from research cohort validation to real multi-site deployment with clinical partners. Testing the Blind Handshake onboarding protocol with genuine institutional boundaries.

Longitudinal Monitoring

Detecting when patient populations shift over time and calibrating the system accordingly. Ensuring trajectory matching stays accurate as the network grows.

A Curadai Research Initiative

SOMA is the research foundation of Curadai — a company building AI infrastructure for non-profit and healthcare institutions. The architectural primitives validated here inform the clinical tools we build. SOMA itself is a research project, not a product; but the problems it solves are the ones our products address.

Visit Curadai ↗ Get in Touch