ExposureGuard-DCPG-Encoder

A PHI exposure graph is not a bag of records. It has structure. A patient's name appears in a clinical note. The same name appears in an ASR transcript 20 minutes later. A matching date shows up in an imaging header. A voice profile correlates with the ASR content. Each of these connections is an edge. Each modality is a node. The risk of re-identification depends on how that graph is connected, not on any single record in isolation.

This model encodes that graph.

What it produces

A 16-dimensional patient embedding capturing the full cross-modal PHI exposure topology, plus a scalar risk score. Both come from a two-layer graph attention network that runs directly over the DCPG structure. No transformers, no external ML framework, no dependencies beyond Python stdlib. The whole thing is 22KB.

The embedding feeds downstream into PolicyNet for masking policy decisions and SynthRewrite-T5 for synthetic text generation. The risk score feeds into FedCRDT-Distill when operating in a federated setting.

Why graph attention specifically

Standard PHI de-identification aggregates per-record features. This model treats the exposure history as a graph and runs attention over it, which means nodes with high risk entropy pull more weight during pooling. A text node carrying a name, date, and MRN gets more influence over the final embedding than a waveform node carrying only a timestamp. That weighting is learned from the graph structure, not hand-coded.

Cross-modal edges matter here. The attention mechanism propagates information across modality boundaries before pooling, so the final embedding reflects not just what each modality contains but how they link to each other.

Architecture

Input graph (nodes + edges)
        |
  Layer 1: GAT  [19 -> 32]
        |
  Layer 2: GAT  [32 -> 16]
        |
  Attention pool (weighted by risk_entropy)
        |
  patient_embedding [16]  +  risk_score [0,1]

Node features (19 dims)

Group	Dims	Content
Modality one-hot	8	text, asr, image_proxy, waveform_proxy, audio_proxy, image_link, audio_link, unknown
PHI type one-hot	8	NAME_DATE_MRN_FACILITY, NAME_DATE_MRN, FACE_IMAGE, WAVEFORM_HEADER, VOICE, FACE_LINK, VOICE_LINK, unknown
Scalars	3	risk_entropy, context_confidence, pseudonym_version_norm

Edge weights from DCPGEdge:

w = 0.30*f_temporal + 0.30*f_semantic + 0.25*f_modality + 0.15*f_trust

Temporal and semantic similarity carry equal weight. Modality match matters less. Trust is a small correction term.

Usage

from dcpg_encoder import encode_patient

result = encode_patient(graph_summary)

result["patient_embedding"]  # List[float], dim=16, L2-normalized
result["node_embeddings"]    # Dict[node_id, List[float]]
result["risk_score"]         # float in [0, 1]
result["embed_dim"]          # 16

From a CRDT federated graph after a device merge:

result = encode_patient(crdt_summary, source="crdt")

Batch encoding:

from inference import predict_batch
results = predict_batch([summary_a, summary_b])

Input

{
  "nodes": [
    {
      "node_id": "patient_1::text::NAME_DATE_MRN_FACILITY",
      "modality": "text",
      "phi_type": "NAME_DATE_MRN_FACILITY",
      "risk_entropy": 0.72,
      "context_confidence": 0.9,
      "pseudonym_version": 1
    },
    {
      "node_id": "patient_1::asr::NAME_DATE_MRN",
      "modality": "asr",
      "phi_type": "NAME_DATE_MRN",
      "risk_entropy": 0.61,
      "context_confidence": 0.7,
      "pseudonym_version": 1
    }
  ],
  "edges": [
    {
      "source": "patient_1::text::NAME_DATE_MRN_FACILITY",
      "target": "patient_1::asr::NAME_DATE_MRN",
      "type": "co_occurrence",
      "weight": 0.71
    }
  ]
}

Output

{
  "patient_embedding": [0.0, 0.189, 0.0, 0.095, ...],
  "node_embeddings": {
    "patient_1::text::NAME_DATE_MRN_FACILITY": [0.0, 0.188, ...]
  },
  "risk_score": 0.429,
  "embed_dim": 16
}

Where it fits in the pipeline

DCPGAdapter.graph_summary()
        |
DCPGEncoder.encode()
        |
    +---+----------------------+
    |                          |
patient_embedding          risk_score
    |                          |
PolicyNet              FedCRDT-Distill
(masking policy)       (federated merge)

The graph summary comes from DCPGAdapter.graph_summary() in the main system or from CRDTGraph.summary() when operating in a federated deployment where two edge devices have merged their graphs.

phi-exposure-guard: the full system this model is part of
dcpg-cross-modal-phi-risk-scorer: single-event risk scorer, runs before graph construction
exposureguard-fedcrdt-distill: takes this model's risk score as input in federated settings
exposureguard-policynet: takes the patient embedding as input for policy decisions
multimodal-phi-masking-benchmark: 10,000 records across 5 modalities with PHI spans, masking decisions, and leakage scores
streaming-phi-deidentification-benchmark: event-level adaptive masking traces

Citation

@software{exposureguard_dcpg_encoder,
  title  = {ExposureGuard-DCPG-Encoder: Graph Attention Encoder for Cross-Modal PHI Exposure Graphs},
  author = {Ganti, Venkata Krishna Azith Teja},
  doi    = {10.5281/zenodo.18865882},
  url    = {https://huggingface.co/vkatg/exposureguard-dcpg-encoder},
  note   = {US Provisional Patent filed 2025-07-05}
}

MIT License. All development and testing used fully synthetic data.

Downloads last month: 4

vkatg
/

exposureguard-dcpg-encoder