lhallee commited on
Commit
d705ed7
·
verified ·
1 Parent(s): 5ca2b37

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +6 -1
README.md CHANGED
@@ -10,6 +10,11 @@ The GitHub with the implementation and requirements.txt can be found [here](http
10
  [ESM++](https://github.com/Synthyra/ESMplusplus) is a faithful implementation of [ESMC](https://www.evolutionaryscale.ai/blog/esm-cambrian) ([license](https://www.evolutionaryscale.ai/policies/cambrian-open-license-agreement)) that allows for batching and standard Huggingface compatibility without requiring the ESM Python package.
11
  The small version corresponds to the 300 million parameter version of ESMC.
12
 
 
 
 
 
 
13
 
14
  ## Use with 🤗 transformers
15
  ```python
@@ -119,7 +124,7 @@ For a more thourough example of fine-tuning, check out our example script [here]
119
 
120
 
121
  ## Returning attention maps
122
- Usually F.scaled_dot_product_attention is used for the attention calculations, which is much faster than native PyTorch. However, it cannot return attention maps.
123
  ESM++ has the option to ```output_attentions```, which will calculate attention manually. This is much slower, so do not use unless you need the attention maps.
124
 
125
  ```python
 
10
  [ESM++](https://github.com/Synthyra/ESMplusplus) is a faithful implementation of [ESMC](https://www.evolutionaryscale.ai/blog/esm-cambrian) ([license](https://www.evolutionaryscale.ai/policies/cambrian-open-license-agreement)) that allows for batching and standard Huggingface compatibility without requiring the ESM Python package.
11
  The small version corresponds to the 300 million parameter version of ESMC.
12
 
13
+ ## Attention backend defaults
14
+ Flex Attention with a block mask that ignores pad tokens is the default attention backend. If Flex Attention is unavailable, ESM++ falls back to native PyTorch attention.
15
+
16
+ For throughput and memory efficiency, `torch.compile(...)` is heavily recommended, especially when using Flex Attention.
17
+
18
 
19
  ## Use with 🤗 transformers
20
  ```python
 
124
 
125
 
126
  ## Returning attention maps
127
+ Flex Attention with a pad-token block mask is used by default for attention calculations, and native PyTorch attention is the fallback. Optimized attention paths do not return attention maps directly.
128
  ESM++ has the option to ```output_attentions```, which will calculate attention manually. This is much slower, so do not use unless you need the attention maps.
129
 
130
  ```python