braindecode
/

PBT

@@ -9,19 +9,17 @@ tags:
   - neuroscience
   - braindecode
   - foundation-model
-  - convolutional
   - transformer
 ---
 # PBT
-Patched Brain Transformer (PBT) model from Klein et al (2025) .
-> **Architecture-only repository.** This repo documents the
 > `braindecode.models.PBT` class. **No pretrained weights are
-> distributed here** — instantiate the model and train it on your own
-> data, or fine-tune from a published foundation-model checkpoint
-> separately.
 ## Quick start
@@ -40,257 +38,45 @@ model = PBT(
 )
 ```
-The signal-shape arguments above are example defaults — adjust them
-to match your recording.
 ## Documentation
-- Full API reference (parameters, references, architecture figure):
-  <https://braindecode.org/stable/generated/braindecode.models.PBT.html>
-- Interactive browser with live instantiation:
   <https://huggingface.co/spaces/braindecode/model-explorer>
 - Source on GitHub: <https://github.com/braindecode/braindecode/blob/master/braindecode/models/patchedtransformer.py#L17>
-## Architecture description
-The block below is the rendered class docstring (parameters,
-references, architecture figure where available).
-<div class='bd-doc'><main>
-<p>Patched Brain Transformer (PBT) model from Klein et al (2025) [pbt]_.</p>
-<span style="display:inline-block;padding:2px 8px;border-radius:4px;background:#d9534f;color:white;font-size:11px;font-weight:600;margin-right:4px;">Foundation Model</span>
- This implementation was based in https://github.com/timonkl/PatchedBrainTransformer/
- .. figure:: https://raw.githubusercontent.com/timonkl/PatchedBrainTransformer/refs/heads/main/PBT_sketch.png
-    :align: center
-    :alt:  Patched Brain Transformer Architecture
-    :width: 680px
- PBT tokenizes EEG trials into per-channel patches, linearly projects each
- patch to a model embedding dimension, prepends a classification token and
- adds channel-aware positional embeddings. The token sequence is processed
- by a Transformer encoder stack and classification is performed from the
- classification token.
- .. rubric:: Macro Components
- - ``PBT.tokenization`` **(patch extraction)**
-   *Operations.* The pre-processed EEG signal :math:`X \in \mathbb{R}^{C \times T}`
-   (with :math:`C = \text{n_chans}` and :math:`T = \text{n_times}`) is divided into
-   non-overlapping patches of size :math:`d_{\text{input}}` along the time axis.
-   This process yields :math:`N` total patches, calculated as
-   :math:`N = C \left\lfloor \frac{T}{D} \right\rfloor` (where :math:`D = d_{\text{input}}`).
-   When time shifts are applied, :math:`N` decreases to
-   :math:`N = C \left\lfloor \frac{T - T_{\text{aug}}}{D} \right\rfloor`.
-   *Role.* Tokenizes EEG trials into fixed-size, per-channel patches so the model
-   remains adaptive to different numbers of channels and recording lengths.
-   Process is inspired by Vision Transformers [visualtransformer]_ and
-   adapted for GPT context from [efficient-batchpacking]_.
- - ``PBT.patch_projection`` **(patch embedding)**
-   *Operations.* The linear layer ``PBT.patch_projection`` maps the tokens from dimension
-   :math:`d_{\text{input}}` to the Transformer embedding dimension :math:`d_{\text{model}}`.
-   Patches :math:`X_P` are projected as :math:`X_E = X_P W_E^\top`, where
-   :math:`W_E \in \mathbb{R}^{d_{\text{model}} \times D}`. In this configuration
-   :math:`d_{\text{model}} = 2D` with :math:`D = d_{\text{input}}`.
-   *Interpretability.* Learns periodic structures similar to frequency filters in
-   the first convolutional layers of CNNs (for example :class:`~braindecode.models.EEGNet`).
-   The learned filters frequently focus on the high-frequency range (20-40 Hz),
-   which correlates with beta and gamma waves linked to higher concentration levels.
- - ``PBT.cls_token`` **(classification token)**
-   *Operations.* A classification token :math:`[c_{\text{ls}}] \in \mathbb{R}^{1 \times d_{\text{model}}}`
-   is prepended to the projected patch sequence :math:`X_E`. The CLS token can optionally
-   be learnable (see ``learnable_cls``).
-   *Role.* Acts as a dedicated readout token that aggregates information through the
-   Transformer encoder stack.
- - ``PBT.pos_embedding`` **(positional embedding)**
-   *Operations.* Positional indices are generated by ``PBT.linear_projection``, an instance
-   of :class:`~braindecode.models.patchedtransformer._ChannelEncoding`, and mapped to vectors
-   through :class:`~torch.nn.Embedding`. The embedding table
-   :math:`W_{\text{pos}} \in \mathbb{R}^{(N+1) \times d_{\text{model}}}` is added to the token
-   sequence, yielding :math:`X_{\text{pos}} = [c_{\text{ls}}, X_E] + W_{\text{pos}}`.
-   *Role/Interpretability.* Introduces spatial and temporal dependence to counter the
-   position invariance of the Transformer encoder. The learned positional embedding
-   exposes spatial relationships, often revealing a symmetric pattern in central regions
-   (C1-C6) associated with the motor cortex.
- - ``PBT.transformer_encoder`` **(sequence processing and attention)**
-   *Operations.* The token sequence passes through :math:`n_{\text{blocks}}` Transformer
-   encoder layers. Each block combines a Multi-Head Self-Attention (MHSA) module with
-   ``num_heads`` attention heads and a Feed-Forward Network (FFN). Both MHSA
-   and FFN use parallel residual connections with Layer Normalization inside the blocks
-   and apply dropout (``drop_prob``) within the Transformer components.
-   *Role/Robustness.* Self-attention enables every token to consider all others, capturing
-   global temporal and spatial dependencies immediately and adaptively. This architecture
-   accommodates arbitrary numbers of patches and channels, supporting pre-training across
-   diverse datasets.
- - ``PBT.final_layer`` **(readout)**
-   *Operations.* A linear layer operates on the processed CLS token only, and the model
-   predicts class probabilities as :math:`y = \operatorname{softmax}([c_{\text{ls}}] W_{\text{class}}^\top + b_{\text{class}})`.
-   *Role.* Performs the final classification from the information aggregated into the CLS
-   token after the Transformer encoder stack.
- .. rubric:: Convolutional Details
- PBT omits convolutional layers; equivalent feature extraction is carried out by the patch
- pipeline and attention stack.
- * **Temporal.** Tokenization slices the EEG into fixed windows of size :math:`D = d_{\text{input}}`
-   (for the default configuration, :math:`D=64` samples :math:`\approx 0.256\,\text{s}` at
-   :math:`250\,\text{Hz}`), while ``PBT.patch_projection`` learns periodic patterns within each
-   patch. The Transformer encoder then models long- and short-range temporal dependencies through
-   self-attention.
- * **Spatial.** Patches are channel-specific, keeping the architecture adaptive to any electrode
-   montage. Channel-aware positional encodings :math:`W_{\text{pos}}` capture relationships between
-   nearby sensors; learned embeddings often form symmetric motifs across motor cortex electrodes
-   (C1–C6), and self-attention propagates information across all channels jointly.
- * **Spectral.** ``PBT.patch_projection`` acts similarly to the first convolutional layer in
-   :class:`~braindecode.models.EEGNet`, learning frequency-selective filters without an explicit
-   Fourier transform. The highest-energy filters typically reside between :math:`20` and
-   :math:`40\,\text{Hz}`, aligning with beta/gamma rhythms tied to focused motor imagery.
- .. rubric:: Attention / Sequential Modules
- * **Attention Details.** ``PBT.transformer_encoder`` stacks :math:`n_{\text{blocks}}` Transformer
-   encoder layers with Multi-Head Self-Attention. Every token attends to all others, enabling
-   immediate global integration across time and channels and supporting heterogeneous datasets.
-   Attention rollout visualisations highlight strong activations over motor cortex electrodes
-   (C3, C4, Cz) during motor imagery decoding.
- .. warning::
-     **Important:** As the other Foundation Models in Braindecode, :class:`PBT` is
-     designed for large-scale pre-training and fine-tuning. Training from
-     scratch on small datasets may lead to suboptimal results. Cross-Dataset
-     pre-training and subsequent fine-tuning is recommended to leverage the
-     full potential of this architecture.
- Parameters
- ----------
- d_input : int, optional
-     Size (in samples) of each patch (token) extracted along the time axis.
- embed_dim : int, optional
-     Transformer embedding dimensionality.
- num_layers : int, optional
-     Number of Transformer encoder layers.
- num_heads : int, optional
-     Number of attention heads.
- drop_prob : float, optional
-     Dropout probability used in Transformer components.
- learnable_cls : bool, optional
-     Whether the classification token is learnable.
- bias_transformer : bool, optional
-     Whether to use bias in Transformer linear layers.
- activation : nn.Module, optional
-     Activation function class to use in Transformer feed-forward layers.
- References
- ----------
- .. [pbt] Klein, T., Minakowski, P., & Sager, S. (2025).
-     Flexible Patched Brain Transformer model for EEG decoding.
-     Scientific Reports, 15(1), 1-12.
-     https://www.nature.com/articles/s41598-025-86294-3
- .. [visualtransformer]  Dosovitskiy, A., Beyer, L., Kolesnikov, A.,
-     Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M.,
-     Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J. & Houlsby,
-     N. (2021). An Image is Worth 16x16 Words: Transformers for Image
-     Recognition at Scale. International Conference on Learning
-     Representations (ICLR).
- .. [efficient-batchpacking] Krell, M. M., Kosec, M., Perez, S. P., &
-     Fitzgibbon, A. (2021). Efficient sequence packing without
-     cross-contamination: Accelerating large language models without
-     impacting performance. arXiv preprint arXiv:2107.02027.
- .. rubric:: Hugging Face Hub integration
- When the optional ``huggingface_hub`` package is installed, all models
- automatically gain the ability to be pushed to and loaded from the
- Hugging Face Hub. Install with::
-     pip install braindecode[hub]
- **Pushing a model to the Hub:**
- .. code::
-     from braindecode.models import PBT
-     # Train your model
-     model = PBT(n_chans=22, n_outputs=4, n_times=1000)
-     # ... training code ...
-     # Push to the Hub
-     model.push_to_hub(
-         repo_id="username/my-pbt-model",
-         commit_message="Initial model upload",
-     )
- **Loading a model from the Hub:**
- .. code::
-     from braindecode.models import PBT
-     # Load pretrained model
-     model = PBT.from_pretrained("username/my-pbt-model")
-     # Load with a different number of outputs (head is rebuilt automatically)
-     model = PBT.from_pretrained("username/my-pbt-model", n_outputs=4)
- **Extracting features and replacing the head:**
- .. code::
-     import torch
-     x = torch.randn(1, model.n_chans, model.n_times)
-     # Extract encoder features (consistent dict across all models)
-     out = model(x, return_features=True)
-     features = out["features"]
-     # Replace the classification head
-     model.reset_head(n_outputs=10)
- **Saving and restoring full configuration:**
- .. code::
-     import json
-     config = model.get_config()            # all __init__ params
-     with open("config.json", "w") as f:
-         json.dump(config, f)
-     model2 = PBT.from_config(config)    # reconstruct (no weights)
- All model parameters (both EEG-specific and model-specific such as
- dropout rates, activation functions, number of filters) are automatically
- saved to the Hub and restored when loading.
- See :ref:`load-pretrained-models` for a complete tutorial.</main>
-</div>
 ## Citation
-Please cite both the original paper for this architecture (see the
-*References* section above) and braindecode:
 ```bibtex
 @article{aristimunha2025braindecode,

   - neuroscience
   - braindecode
   - foundation-model
   - transformer
 ---
 # PBT
+Patched Brain Transformer (PBT) model from Klein et al (2025) [pbt].
+> **Architecture-only repository.** Documents the
 > `braindecode.models.PBT` class. **No pretrained weights are
+> distributed here.** Instantiate the model and train it on your own
+> data.
 ## Quick start
 )
 ```
+The signal-shape arguments above are illustrative defaults — adjust to
+match your recording.
 ## Documentation
+- Full API reference: <https://braindecode.org/stable/generated/braindecode.models.PBT.html>
+- Interactive browser (live instantiation, parameter counts):
   <https://huggingface.co/spaces/braindecode/model-explorer>
 - Source on GitHub: <https://github.com/braindecode/braindecode/blob/master/braindecode/models/patchedtransformer.py#L17>
+## Architecture
+![PBT architecture](https://raw.githubusercontent.com/timonkl/PatchedBrainTransformer/refs/heads/main/PBT_sketch.png)
+## Parameters
+| Parameter | Type | Description |
+|---|---|---|
+| `d_input` | int, optional | Size (in samples) of each patch (token) extracted along the time axis. |
+| `embed_dim` | int, optional | Transformer embedding dimensionality. |
+| `num_layers` | int, optional | Number of Transformer encoder layers. |
+| `num_heads` | int, optional | Number of attention heads. |
+| `drop_prob` | float, optional | Dropout probability used in Transformer components. |
+| `learnable_cls` | bool, optional | Whether the classification token is learnable. |
+| `bias_transformer` | bool, optional | Whether to use bias in Transformer linear layers. |
+| `activation` | nn.Module, optional | Activation function class to use in Transformer feed-forward layers. |
+## References
+1. Klein, T., Minakowski, P., & Sager, S. (2025). Flexible Patched Brain Transformer model for EEG decoding. Scientific Reports, 15(1), 1-12. https://www.nature.com/articles/s41598-025-86294-3
+2. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J. & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations (ICLR).
+3. Krell, M. M., Kosec, M., Perez, S. P., & Fitzgibbon, A. (2021). Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance. arXiv preprint arXiv:2107.02027.
 ## Citation
+Cite the original architecture paper (see *References* above) and braindecode:
 ```bibtex
 @article{aristimunha2025braindecode,