| | --- |
| | library_name: transformers |
| | tags: |
| | - chemistry |
| | - molecule |
| | license: mit |
| | --- |
| | |
| | # Model Card for Roberta Zinc Compression Encoder |
| |
|
| | ### Model Description |
| |
|
| | `roberta_zinc_compression_encoder` contains several MLP-style compression heads trained to compress |
| | molecule embeddings from the [roberta_zinc_480m](https://huggingface.co/entropy/roberta_zinc_480m) |
| | from the native dimension of 768 to smaller dimensions - 512, 256, 128, 64, 32 |
| |
|
| | - **Developed by:** Karl Heyer |
| | - **License:** MIT |
| |
|
| |
|
| | ### Direct Use |
| |
|
| | Usage examples. Note that input SMILES strings should be canonicalized. |
| |
|
| | ```python |
| | from sentence_transformers import models, SentenceTransformer |
| | from transformers import AutoModel |
| | |
| | transformer = models.Transformer("entropy/roberta_zinc_480m", |
| | max_seq_length=256, |
| | model_args={"add_pooling_layer": False}) |
| | |
| | pooling = models.Pooling(transformer.get_word_embedding_dimension(), |
| | pooling_mode="mean") |
| | |
| | roberta_zinc = SentenceTransformer(modules=[transformer, pooling]) |
| | |
| | compression_encoder = AutoModel.from_pretrained("entropy/roberta_zinc_compression_encoder", |
| | trust_remote_code=True) |
| | # smiles should be canonicalized |
| | smiles = [ |
| | "Brc1cc2c(NCc3ccccc3)ncnc2s1", |
| | "Brc1cc2c(NCc3ccccn3)ncnc2s1", |
| | "Brc1cc2c(NCc3cccs3)ncnc2s1", |
| | "Brc1cc2c(NCc3ccncc3)ncnc2s1", |
| | "Brc1cc2c(Nc3ccccc3)ncnc2s1", |
| | ] |
| | |
| | embeddings = roberta_zinc.encode(smiles, convert_to_tensor=True) |
| | print(embeddings.shape) |
| | # torch.Size([6, 768]) |
| | |
| | compressed_embeddings = compression_encoder.compress(embeddings.cpu(), |
| | compression_sizes=[32, 64, 128, 256, 512]) |
| | |
| | for k,v in compressed_embeddings.items(): |
| | print(k, v.shape) |
| | |
| | # 32 torch.Size([6, 32]) |
| | # 64 torch.Size([6, 64]) |
| | # 128 torch.Size([6, 128]) |
| | # 256 torch.Size([6, 256]) |
| | # 512 torch.Size([6, 512]) |
| | ``` |
| |
|
| | ### Training Procedure |
| |
|
| | #### Preprocessing |
| |
|
| | A dataset of 30m SMILES strings were assembled from the [ZINC Database](https://zinc.docking.org/) |
| | and the [Enamine](https://enamine.net/) real space. SMILES were canonicalized and embedded with the |
| | [roberta_zinc_480m](https://huggingface.co/entropy/roberta_zinc_480m) model. |
| |
|
| | #### Training Hyperparameters |
| |
|
| | The model was trained for 1 epoch with a learning rate of 1e-3, cosine scheduling, weight decay of 0.01 |
| | and 10% warmup. |
| |
|
| | #### Training Loss |
| |
|
| | For training, the input batch of embeddings is compressed with all compression sizes via |
| | the encoder layers, the reconstructed via the decoder layers. |
| |
|
| | For the encoder, we compute the pairwise similarities of the compressed embeddings and |
| | compare to the pairwise similarities of the input embeddings using row-wise pearson correlation. |
| |
|
| | For the decoder, we compute the cosine similarity of the reconstructed embeddings to the inputs. |
| |
|
| | ## Model Card Authors |
| |
|
| | Karl Heyer |
| |
|
| | ## Model Card Contact |
| |
|
| | karl@darmatterai.xyz |
| |
|
| | --- |
| | license: mit |
| | --- |