braindecode
/

EEGPT

@@ -14,13 +14,12 @@ tags:
 # EEGPT
-EEGPT: Pretrained Transformer for Universal and Reliable Representation of EEG Signals from Wang et al. (2024) .
-> **Architecture-only repository.** This repo documents the
 > `braindecode.models.EEGPT` class. **No pretrained weights are
-> distributed here** — instantiate the model and train it on your own
-> data, or fine-tune from a published foundation-model checkpoint
-> separately.
 ## Quick start
@@ -39,486 +38,50 @@ model = EEGPT(
 )
 ```
-The signal-shape arguments above are example defaults — adjust them
-to match your recording.
 ## Documentation
-- Full API reference (parameters, references, architecture figure):
-  <https://braindecode.org/stable/generated/braindecode.models.EEGPT.html>
-- Interactive browser with live instantiation:
   <https://huggingface.co/spaces/braindecode/model-explorer>
 - Source on GitHub: <https://github.com/braindecode/braindecode/blob/master/braindecode/models/eegpt.py#L21>
-## Architecture description
-The block below is the rendered class docstring (parameters,
-references, architecture figure where available).
-<div class='bd-doc'><main>
-<p>EEGPT: Pretrained Transformer for Universal and Reliable Representation of EEG Signals from Wang et al. (2024) <a class="citation-reference" href="#eegpt" id="citation-reference-1" role="doc-biblioref">[eegpt]</a>.</p>
-<span style="display:inline-block;padding:2px 8px;border-radius:4px;background:#d9534f;color:white;font-size:11px;font-weight:600;margin-right:4px;">Foundation Model</span><span style="display:inline-block;padding:2px 8px;border-radius:4px;background:#56B4E9;color:white;font-size:11px;font-weight:600;margin-right:4px;">Attention/Transformer</span><figure class="align-center">
-<img alt="EEGPT Architecture" src="https://github.com/BINE022/EEGPT/raw/main/figures/EEGPT.jpg" style="width: 1000px;" />
-<figcaption>
-<p>a) The EEGPT structure involves patching the input EEG signal as <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <msub>
-    <mi>p</mi>
-    <mrow>
-      <mi>i</mi>
-      <mo>,</mo>
-      <mi>j</mi>
-    </mrow>
-  </msub>
-</math> through masking
-(50% time and 80% channel patches), creating masked part <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <mi>ℳ</mi>
-</math> and unmasked part <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <mover accent="true">
-    <mi>ℳ</mi>
-    <mo stretchy="false">ˉ</mo>
-  </mover>
-</math>.
-b) Local spatio-temporal embedding maps patches to tokens.
-c) Use of dual self-supervised learning with Spatio-Temporal Representation Alignment and Mask-based Reconstruction.</p>
-</figcaption>
-</figure>
-<p><strong>EEGPT</strong> is a pretrained transformer model designed for universal EEG feature extraction.
-It addresses challenges like low SNR and inter-subject variability by employing
-a dual self-supervised learning method that combines <strong>Spatio-Temporal Representation Alignment</strong>
-and <strong>Mask-based Reconstruction</strong> <a class="citation-reference" href="#eegpt" id="citation-reference-2" role="doc-biblioref">[eegpt]</a>.</p>
-<p><strong>Model Overview (Layer-by-layer)</strong></p>
-<ol class="arabic simple">
-<li><p><strong>Patch embedding</strong> (<span class="docutils literal">_PatchEmbed</span> or <span class="docutils literal">_PatchNormEmbed</span>): split each channel into
-<span class="docutils literal">patch_size</span> time patches and project to <span class="docutils literal">embed_dim</span>, yielding tokens with shape
-<span class="docutils literal">(batch, n_patches, n_chans, embed_dim)</span>.</p></li>
-<li><p><strong>Channel embedding</strong> (<span class="docutils literal">chan_embed</span>): add a learned embedding for each channel to preserve
-spatial identity before attention.</p></li>
-<li><p><strong>Transformer encoder blocks</strong> (<span class="docutils literal">_EEGTransformer.blocks</span>): for each patch group, append
-<span class="docutils literal">embed_num</span> learned summary tokens and process the sequence with multi-head self-attention
-and MLP layers.</p></li>
-<li><p><strong>Summary extraction</strong>: keep only the summary tokens, apply <span class="docutils literal">norm</span> if set, and reshape back
-to <span class="docutils literal">(batch, n_patches, embed_num, embed_dim)</span>.</p></li>
-<li><p><strong>Task head</strong> (<span class="docutils literal">final_layer</span>): flatten summary tokens across patches and map to
-<span class="docutils literal">n_outputs</span>; if <span class="docutils literal">return_encoder_output=True</span>, return the encoder features instead.</p></li>
-</ol>
-<p><strong>Dual Self-Supervised Learning</strong></p>
-<p>EEGPT moves beyond simple masked reconstruction by introducing a representation alignment objective.
-The pretraining loss <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <mi>ℒ</mi>
-</math> is the sum of alignment loss <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <msub>
-    <mi>ℒ</mi>
-    <mi>A</mi>
-  </msub>
-</math> and reconstruction loss <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <msub>
-    <mi>ℒ</mi>
-    <mi>R</mi>
-  </msub>
-</math>:</p>
-<div>
-<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
-  <mi>ℒ</mi>
-  <mo>=</mo>
-  <msub>
-    <mi>ℒ</mi>
-    <mi>A</mi>
-  </msub>
-  <mo>+</mo>
-  <msub>
-    <mi>ℒ</mi>
-    <mi>R</mi>
-  </msub>
-</math>
-</div>
-<ol class="arabic">
-<li><p><strong>Spatio-Temporal Representation Alignment:</strong> (<math xmlns="http://www.w3.org/1998/Math/MathML">
-  <msub>
-    <mi>ℒ</mi>
-    <mi>A</mi>
-  </msub>
-</math>)
-Aligns the predicted features of masked regions with global features extracted by a Momentum Encoder.
-This forces the model to learn semantic, high-level representations rather than just signal waveform details.</p>
-<div>
-<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
-  <msub>
-    <mi>ℒ</mi>
-    <mi>A</mi>
-  </msub>
-  <mo>=</mo>
-  <mo form="prefix">−</mo>
-  <mfrac>
-    <mn>1</mn>
-    <mi>N</mi>
-  </mfrac>
-  <munderover>
-    <mo movablelimits="true">∑</mo>
-    <mrow>
-      <mi>j</mi>
-      <mo>=</mo>
-      <mn>1</mn>
-    </mrow>
-    <mi>N</mi>
-  </munderover>
-  <mo stretchy="false">|</mo>
-  <mo stretchy="false">|</mo>
-  <mi>p</mi>
-  <mi>r</mi>
-  <mi>e</mi>
-  <msub>
-    <mi>d</mi>
-    <mi>j</mi>
-  </msub>
-  <mo>−</mo>
-  <mi>L</mi>
-  <mi>N</mi>
-  <mo stretchy="false">(</mo>
-  <mi>m</mi>
-  <mi>e</mi>
-  <mi>n</mi>
-  <msub>
-    <mi>c</mi>
-    <mi>j</mi>
-  </msub>
-  <mo stretchy="false">)</mo>
-  <mo stretchy="false">|</mo>
-  <msubsup>
-    <mo stretchy="false">|</mo>
-    <mn>2</mn>
-    <mn>2</mn>
-  </msubsup>
-</math>
-</div>
-<p>where <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <mi>p</mi>
-  <mi>r</mi>
-  <mi>e</mi>
-  <msub>
-    <mi>d</mi>
-    <mi>j</mi>
-  </msub>
-</math> is the predictor output and <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <mi>m</mi>
-  <mi>e</mi>
-  <mi>n</mi>
-  <msub>
-    <mi>c</mi>
-    <mi>j</mi>
-  </msub>
-</math> is the momentum encoder output.</p>
-</li>
-<li><p><strong>Mask-based Reconstruction:</strong> (<math xmlns="http://www.w3.org/1998/Math/MathML">
-  <msub>
-    <mi>ℒ</mi>
-    <mi>R</mi>
-  </msub>
-</math>)
-Standard masked autoencoder objective to reconstruct the raw EEG patches, ensuring local temporal fidelity.</p>
-<div>
-<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
-  <msub>
-    <mi>ℒ</mi>
-    <mi>R</mi>
-  </msub>
-  <mo>=</mo>
-  <mo form="prefix">−</mo>
-  <mfrac>
-    <mn>1</mn>
-    <mrow>
-      <mo stretchy="false">|</mo>
-      <mi>ℳ</mi>
-      <mo stretchy="false">|</mo>
-    </mrow>
-  </mfrac>
-  <munder>
-    <mo movablelimits="true">∑</mo>
-    <mrow>
-      <mo stretchy="false">(</mo>
-      <mi>i</mi>
-      <mo>,</mo>
-      <mi>j</mi>
-      <mo stretchy="false">)</mo>
-      <mo>∈</mo>
-      <mi>ℳ</mi>
-    </mrow>
-  </munder>
-  <mo stretchy="false">|</mo>
-  <mo stretchy="false">|</mo>
-  <mi>r</mi>
-  <mi>e</mi>
-  <msub>
-    <mi>c</mi>
-    <mrow>
-      <mi>i</mi>
-      <mo>,</mo>
-      <mi>j</mi>
-    </mrow>
-  </msub>
-  <mo>−</mo>
-  <mi>L</mi>
-  <mi>N</mi>
-  <mo stretchy="false">(</mo>
-  <msub>
-    <mi>p</mi>
-    <mrow>
-      <mi>i</mi>
-      <mo>,</mo>
-      <mi>j</mi>
-    </mrow>
-  </msub>
-  <mo stretchy="false">)</mo>
-  <mo stretchy="false">|</mo>
-  <msubsup>
-    <mo stretchy="false">|</mo>
-    <mn>2</mn>
-    <mn>2</mn>
-  </msubsup>
-</math>
-</div>
-<p>where <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <mi>r</mi>
-  <mi>e</mi>
-  <msub>
-    <mi>c</mi>
-    <mrow>
-      <mi>i</mi>
-      <mo>,</mo>
-      <mi>j</mi>
-    </mrow>
-  </msub>
-</math> is the reconstructed patch and <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <msub>
-    <mi>p</mi>
-    <mrow>
-      <mi>i</mi>
-      <mo>,</mo>
-      <mi>j</mi>
-    </mrow>
-  </msub>
-</math> is the original patch.</p>
-</li>
-</ol>
-<p><strong>Macro Components</strong></p>
-<ul class="simple">
-<li><dl class="simple">
-<dt><cite>EEGPT.target_encoder</cite> <strong>(Universal Encoder)</strong></dt>
-<dd><ul>
-<li><p><em>Operations.</em> A hierarchical backbone that consists of <strong>Local Spatio-Temporal Embedding</strong> followed
-by a standard Transformer encoder <a class="citation-reference" href="#eegpt" id="citation-reference-3" role="doc-biblioref">[eegpt]</a>.</p></li>
-<li><p><em>Role.</em> Maps raw spatio-temporal EEG patches into a sequence of latent tokens <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <mi>z</mi>
-</math>.</p></li>
-</ul>
-</dd>
-</dl>
-</li>
-<li><dl class="simple">
-<dt><cite>EEGPT.chans_id</cite> <strong>(Channel Identification)</strong></dt>
-<dd><ul>
-<li><p><em>Operations.</em> A buffer containing channel indices mapped from the standard channel names provided
-in <span class="docutils literal">chs_info</span> <a class="citation-reference" href="#eegpt" id="citation-reference-4" role="doc-biblioref">[eegpt]</a>.</p></li>
-<li><p><em>Role.</em> Provides the spatial identity for each input channel, allowing the model to look up
-the correct channel embedding vector <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <msub>
-    <mi>ς</mi>
-    <mi>i</mi>
-  </msub>
-</math>.</p></li>
-</ul>
-</dd>
-</dl>
-</li>
-<li><dl class="simple">
-<dt><strong>Local Spatio-Temporal Embedding</strong> (Input Processing)</dt>
-<dd><ul>
-<li><p><em>Operations.</em> The input signal <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <mi>X</mi>
-</math> is chunked into patches <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <msub>
-    <mi>p</mi>
-    <mrow>
-      <mi>i</mi>
-      <mo>,</mo>
-      <mi>j</mi>
-    </mrow>
-  </msub>
-</math>. Each patch
-is linearly projected and summed with a specific channel embedding:
-<math xmlns="http://www.w3.org/1998/Math/MathML">
-  <mi>t</mi>
-  <mi>o</mi>
-  <mi>k</mi>
-  <mi>e</mi>
-  <msub>
-    <mi>n</mi>
-    <mrow>
-      <mi>i</mi>
-      <mo>,</mo>
-      <mi>j</mi>
-    </mrow>
-  </msub>
-  <mo>=</mo>
-  <mtext>Embed</mtext>
-  <mo stretchy="false">(</mo>
-  <msub>
-    <mi>p</mi>
-    <mrow>
-      <mi>i</mi>
-      <mo>,</mo>
-      <mi>j</mi>
-    </mrow>
-  </msub>
-  <mo stretchy="false">)</mo>
-  <mo>+</mo>
-  <msub>
-    <mi>ς</mi>
-    <mi>i</mi>
-  </msub>
-</math> <a class="citation-reference" href="#eegpt" id="citation-reference-5" role="doc-biblioref">[eegpt]</a>.</p></li>
-<li><p><em>Role.</em> Converts the 2D EEG grid (Channels <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <mo>×</mo>
-</math> Time) into a unified sequence of tokens
-that preserves both channel identity and temporal order.</p></li>
-</ul>
-</dd>
-</dl>
-</li>
-</ul>
-<p><strong>How the information is encoded temporally, spatially, and spectrally</strong></p>
-<ul class="simple">
-<li><p><strong>Temporal.</strong>
-The model segments continuous EEG signals into small, non-overlapping patches (e.g., 250ms windows
-with <span class="docutils literal">patch_size=64</span>) <a class="citation-reference" href="#eegpt" id="citation-reference-6" role="doc-biblioref">[eegpt]</a>. This <strong>Patching</strong> mechanism captures short-term local temporal
-structure, while the subsequent Transformer encoder captures long-range temporal dependencies across
-the entire window.</p></li>
-<li><p><strong>Spatial.</strong>
-Unlike convolutional models that may rely on fixed spatial order, EEGPT uses <strong>Channel Embeddings</strong>
-<math xmlns="http://www.w3.org/1998/Math/MathML">
-  <msub>
-    <mi>ς</mi>
-    <mi>i</mi>
-  </msub>
-</math> <a class="citation-reference" href="#eegpt" id="citation-reference-7" role="doc-biblioref">[eegpt]</a>. Each channel's data is treated as a distinct sequence of tokens tagged
-with its spatial identity. This allows the model to flexibly handle different montages and
-missing channels by simply mapping channel names to their corresponding learnable embeddings.</p></li>
-<li><p><strong>Spectral.</strong>
-Spectral information is implicitly learned through the <strong>Mask-based Reconstruction</strong> objective
-(<math xmlns="http://www.w3.org/1998/Math/MathML">
-  <msub>
-    <mi>ℒ</mi>
-    <mi>R</mi>
-  </msub>
-</math>) <a class="citation-reference" href="#eegpt" id="citation-reference-8" role="doc-biblioref">[eegpt]</a>. By forcing the model to reconstruct raw waveforms (including phase
-and amplitude) from masked inputs, the model learns to encode frequency-specific patterns necessary
-refines this by encouraging these spectral features to align with robust, high-level semantic representations.</p></li>
-</ul>
-<p><strong>Pretrained Weights</strong></p>
-<p>Weights are available on <a class="reference external" href="https://huggingface.co/braindecode/eegpt-pretrained">HuggingFace</a>.</p>
-<aside class="admonition important">
-<p class="admonition-title">Important</p>
-<p><strong>Pre-trained Weights Available</strong></p>
-<p>This model has pre-trained weights available on the Hugging Face Hub.
-<a class="reference external" href="https://huggingface.co/braindecode/eegpt-pretrained">Link here</a>.</p>
-<p>You can load them using:</p>
-<p>To push your own trained model to the Hub:</p>
-<p>Requires installing <span class="docutils literal">braindecode[hug]</span> for Hub integration.</p>
-</aside>
-<p><strong>Usage</strong></p>
-<p>The model can be initialized for specific downstream tasks (e.g., classification) by specifying
-<cite>n_outputs</cite>, <cite>chs_info</cite>, <cite>n_times</cite>.</p>
-<section id="parameters">
-<h2>Parameters</h2>
-<dl class="simple">
-<dt>return_encoder_output<span class="classifier">bool, default=False</span></dt>
-<dd><p>Whether to return the encoder output or the classifier output.</p>
-</dd>
-<dt>patch_size<span class="classifier">int, default=64</span></dt>
-<dd><p>Size of the patches for the transformer.</p>
-</dd>
-<dt>patch_stride<span class="classifier">int, default=32</span></dt>
-<dd><p>Stride of the patches for the transformer.</p>
-</dd>
-<dt>embed_num<span class="classifier">int, default=4</span></dt>
-<dd><p>Number of summary tokens used for the global representation.</p>
-</dd>
-<dt>embed_dim<span class="classifier">int, default=512</span></dt>
-<dd><p>Dimension of the embeddings.</p>
-</dd>
-<dt>depth<span class="classifier">int, default=8</span></dt>
-<dd><p>Number of transformer layers.</p>
-</dd>
-<dt>num_heads<span class="classifier">int, default=8</span></dt>
-<dd><p>Number of attention heads.</p>
-</dd>
-<dt>mlp_ratio<span class="classifier">float, default=4.0</span></dt>
-<dd><p>Ratio of the MLP hidden dimension to the embedding dimension.</p>
-</dd>
-<dt>drop_prob<span class="classifier">float, default=0.0</span></dt>
-<dd><p>Dropout probability.</p>
-</dd>
-<dt>attn_drop_rate<span class="classifier">float, default=0.0</span></dt>
-<dd><p>Attention dropout rate.</p>
-</dd>
-<dt>drop_path_rate<span class="classifier">float, default=0.0</span></dt>
-<dd><p>Drop path rate.</p>
-</dd>
-<dt>init_std<span class="classifier">float, default=0.02</span></dt>
-<dd><p>Standard deviation for weight initialization.</p>
-</dd>
-<dt>qkv_bias<span class="classifier">bool, default=True</span></dt>
-<dd><p>Whether to use bias in the QKV projection.</p>
-</dd>
-<dt>norm_layer<span class="classifier">torch.nn.Module, default=None</span></dt>
-<dd><p>Normalization layer. If None, defaults to <span class="docutils literal">nn.LayerNorm</span> with epsilon <span class="docutils literal">layer_norm_eps</span>.</p>
-</dd>
-<dt>layer_norm_eps<span class="classifier">float, default=1e-6</span></dt>
-<dd><p>Epsilon value for the normalization layer.</p>
-</dd>
-</dl>
-</section>
-<section id="references">
-<h2>References</h2>
-<div role="list" class="citation-list">
-<div class="citation" id="eegpt" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span>eegpt<span class="fn-bracket">]</span></span>
-<span class="backrefs">(<a role="doc-backlink" href="#citation-reference-1">1</a>,<a role="doc-backlink" href="#citation-reference-2">2</a>,<a role="doc-backlink" href="#citation-reference-3">3</a>,<a role="doc-backlink" href="#citation-reference-4">4</a>,<a role="doc-backlink" href="#citation-reference-5">5</a>,<a role="doc-backlink" href="#citation-reference-6">6</a>,<a role="doc-backlink" href="#citation-reference-7">7</a>,<a role="doc-backlink" href="#citation-reference-8">8</a>)</span>
-<p>Wang, G., Liu, W., He, Y., Xu, C., Ma, L., &amp; Li, H. (2024).
-EEGPT: Pretrained transformer for universal and reliable representation of eeg signals.
-Advances in Neural Information Processing Systems, 37, 39249-39280.
-Online: <a class="reference external" href="https://proceedings.neurips.cc/paper_files/paper/2024/file/4540d267eeec4e5dbd9dae9448f0b739-Paper-Conference.pdf">https://proceedings.neurips.cc/paper_files/paper/2024/file/4540d267eeec4e5dbd9dae9448f0b739-Paper-Conference.pdf</a></p>
-</div>
-</div>
-</section>
-<section id="notes">
-<h2>Notes</h2>
-<p>When loading pretrained weights from the original EEGPT checkpoint (e.g., for
-fine-tuning), you may encounter &quot;unexpected keys&quot; related to the <cite>predictor</cite>
-and <cite>reconstructor</cite> modules (e.g., <cite>predictor.mask_token</cite>, <cite>reconstructor.time_embed</cite>).
-These components are used only during the self-supervised pre-training phase
-(Masked Auto-Encoder) and are not part of this encoder-only model used for
-downstream tasks. It is safe to ignore them.</p>
-<p><strong>Hugging Face Hub integration</strong></p>
-<p>When the optional <span class="docutils literal">huggingface_hub</span> package is installed, all models
-automatically gain the ability to be pushed to and loaded from the
-Hugging Face Hub. Install with:</p>
-<pre class="literal-block">pip install braindecode[hub]</pre>
-<p><strong>Pushing a model to the Hub:</strong></p>
-<p><strong>Loading a model from the Hub:</strong></p>
-<p><strong>Extracting features and replacing the head:</strong></p>
-<p><strong>Saving and restoring full configuration:</strong></p>
-<p>All model parameters (both EEG-specific and model-specific such as
-dropout rates, activation functions, number of filters) are automatically
-saved to the Hub and restored when loading.</p>
-<p>See :ref:`load-pretrained-models` for a complete tutorial.</p>
-</section>
-</main>
-</div>
 ## Citation
-Please cite both the original paper for this architecture (see the
-*References* section above) and braindecode:
 ```bibtex
 @article{aristimunha2025braindecode,

 # EEGPT
+EEGPT: Pretrained Transformer for Universal and Reliable Representation of EEG Signals from Wang et al. (2024) [eegpt].
+> **Architecture-only repository.** Documents the
 > `braindecode.models.EEGPT` class. **No pretrained weights are
+> distributed here.** Instantiate the model and train it on your own
+> data.
 ## Quick start
 )
 ```
+The signal-shape arguments above are illustrative defaults — adjust to
+match your recording.
 ## Documentation
+- Full API reference: <https://braindecode.org/stable/generated/braindecode.models.EEGPT.html>
+- Interactive browser (live instantiation, parameter counts):
   <https://huggingface.co/spaces/braindecode/model-explorer>
 - Source on GitHub: <https://github.com/braindecode/braindecode/blob/master/braindecode/models/eegpt.py#L21>
+## Architecture
+![EEGPT architecture](https://github.com/BINE022/EEGPT/raw/main/figures/EEGPT.jpg)
+## Parameters
+| Parameter | Type | Description |
+|---|---|---|
+| `return_encoder_output` | bool, default=False | Whether to return the encoder output or the classifier output. |
+| `patch_size` | int, default=64 | Size of the patches for the transformer. |
+| `patch_stride` | int, default=32 | Stride of the patches for the transformer. |
+| `embed_num` | int, default=4 | Number of summary tokens used for the global representation. |
+| `embed_dim` | int, default=512 | Dimension of the embeddings. |
+| `depth` | int, default=8 | Number of transformer layers. |
+| `num_heads` | int, default=8 | Number of attention heads. |
+| `mlp_ratio` | float, default=4.0 | Ratio of the MLP hidden dimension to the embedding dimension. |
+| `drop_prob` | float, default=0.0 | Dropout probability. |
+| `attn_drop_rate` | float, default=0.0 | Attention dropout rate. |
+| `drop_path_rate` | float, default=0.0 | Drop path rate. |
+| `init_std` | float, default=0.02 | Standard deviation for weight initialization. |
+| `qkv_bias` | bool, default=True | Whether to use bias in the QKV projection. |
+| `norm_layer` | torch.nn.Module, default=None | Normalization layer. If None, defaults to `nn.LayerNorm` with epsilon `layer_norm_eps`. |
+| `layer_norm_eps` | float, default=1e-6 | Epsilon value for the normalization layer. |
+## References
+1. Wang, G., Liu, W., He, Y., Xu, C., Ma, L., & Li, H. (2024). EEGPT: Pretrained transformer for universal and reliable representation of eeg signals. Advances in Neural Information Processing Systems, 37, 39249-39280. Online: https://proceedings.neurips.cc/paper_files/paper/2024/file/4540d267eeec4e5dbd9dae9448f0b739-Paper-Conference.pdf
 ## Citation
+Cite the original architecture paper (see *References* above) and braindecode:
 ```bibtex
 @article{aristimunha2025braindecode,