| from docx import Document |
| from docx.shared import Pt, Inches |
| from docx.enum.text import WD_ALIGN_PARAGRAPH |
|
|
| doc = Document() |
|
|
| style = doc.styles['Normal'] |
| font = style.font |
| font.name = 'Arial' |
| font.size = Pt(11) |
|
|
| |
| title = doc.add_heading('LLM Support for 7B Models (OLMo) in DeepChem', level=1) |
| title.alignment = WD_ALIGN_PARAGRAPH.CENTER |
|
|
| p = doc.add_paragraph() |
| p.alignment = WD_ALIGN_PARAGRAPH.CENTER |
| run = p.add_run('V SaraVivek (Vivek Varikuti)') |
| run.bold = True |
| run.font.size = Pt(12) |
| p.add_run('\nvivekvarikuti22@gmail.com | github.com/vivekvar-dl | vivekvari.dev') |
|
|
| doc.add_paragraph('') |
|
|
| |
| doc.add_heading('Introduction', level=2) |
|
|
| doc.add_paragraph( |
| 'DeepChems HuggingFaceModel wrapper does a solid job with encoder models. ' |
| 'ChemBERTa, MoLFormer, masked language modeling, classification, regression β all of that works. ' |
| 'But theres a pretty fundamental gap. It has zero support for decoder-only causal language models. ' |
| 'No GPT-style models, no text generation, nothing autoregressive.' |
| ) |
|
|
| doc.add_paragraph( |
| 'This project adds OLMo-2 (Allen AIs open language model) to DeepChem. ' |
| 'The goal isnt just to wrap another HuggingFace model though. ' |
| 'Its to make a causal LM genuinely useful for molecular science. ' |
| 'That means continued pretraining on SMILES strings, finetuning for property prediction tasks, ' |
| 'and most importantly β generating new molecules.' |
| ) |
|
|
| doc.add_paragraph( |
| 'Why does this matter for the community? ChemBERTa can tell you about a molecule but it cant make new ones. ' |
| 'Its an encoder, thats not what encoders do. A causal LM like OLMo can actually produce novel SMILES strings. ' |
| 'For drug discovery and molecular design thats a really big deal. ' |
| 'And because OLMo was pretrained on the Dolma corpus (which includes tons of scientific papers), ' |
| 'it already has some chemistry knowledge baked in before you even start finetuning on molecular data.' |
| ) |
|
|
| doc.add_paragraph( |
| 'I picked OLMo specifically because its fully open β weights, training data, code, all of it. ' |
| 'No license restrictions like LLaMA. OLMo-2 is natively supported in HuggingFace transformers so you ' |
| 'dont need custom packages. And it has 1B and 7B variants which is nice for development vs production.' |
| ) |
|
|
| doc.add_heading('What I already built', level=3) |
|
|
| doc.add_paragraph( |
| 'I cloned DeepChem and started building before writing this proposal. ' |
| 'First thing that happened β ChemBERTa wouldnt import. The module ' |
| 'transformers.models.roberta.tokenization_roberta_fast got removed in transformers 5.x ' |
| 'and nobody had caught it. I fixed it in PR #4913 and filed issue #4912 about the broader ' |
| 'transformers 5.x compatibility problems. Thats how I got into the codebase.' |
| ) |
|
|
| doc.add_paragraph( |
| 'Then I started on the OLMo wrapper. Hit an interesting problem right away β ' |
| 'HuggingFace doesnt have OlmoForSequenceClassification. The class just doesnt exist. ' |
| 'So you cant do regression or classification with OLMo out of the box. ' |
| 'I wrote one from scratch using last-token pooling (same approach as LlamaForSequenceClassification). ' |
| 'Also added causal_lm as a task type in HuggingFaceModel and built a generate() method.' |
| ) |
|
|
| doc.add_paragraph( |
| 'Ran experiments on MoleculeNet to see if the thing actually works:' |
| ) |
|
|
| doc.add_paragraph( |
| 'BBBP (blood brain barrier): ROC-AUC 0.67 β tiny random init model, 200 samples, 3 epochs. ' |
| 'Not amazing but clearly above random chance so the architecture learns.', style='List Bullet' |
| ) |
| doc.add_paragraph( |
| 'ESOL (solubility): R squared 0.37 β same conditions.', style='List Bullet' |
| ) |
| doc.add_paragraph( |
| 'SMILES generation: 0% valid molecules. Everything it generated was broken SMILES.', style='List Bullet' |
| ) |
|
|
| doc.add_paragraph( |
| 'That 0% generation result is actually the most important finding. ' |
| 'It tells you the wrapper works fine, training works fine, but without serious pretraining on ' |
| 'a molecular corpus the model just outputs nonsense. Thats the core problem this project needs to solve.' |
| ) |
|
|
| doc.add_paragraph( |
| 'I also compared tokenization β OLMo vs ChemBERTa on real drugs (aspirin, caffeine, penicillin, paclitaxel). ' |
| 'OLMo uses fewer tokens overall (100K vocab vs 600) but fragments chemical concepts. ' |
| '[C@@H] which is one thing in chemistry (a stereocenter) gets split into 4 tokens. ' |
| 'ChemBERTa learned better groupings like (=O) for carbonyl. ' |
| 'Something to address in the stretch goals.' |
| ) |
|
|
| doc.add_paragraph('8 unit tests written, all passing in 27 seconds on CPU.') |
|
|
| |
| doc.add_heading('Relevant Experience and Interest', level=2) |
|
|
| doc.add_paragraph( |
| 'Im Vivek, I just finished my B.Tech in AI & ML from Usha Rama College of Engineering (2021-2025). ' |
| 'Ive been working as an AI Engineer for the past year building production systems.' |
| ) |
|
|
| doc.add_heading('Work experience', level=3) |
|
|
| doc.add_paragraph( |
| 'Right now Im working with the Andhra Pradesh Police on a government AI initiative (AI4AP). ' |
| 'I built an AI legal compliance system for POCSO cases β RAG pipeline over 1000+ legal documents, ' |
| 'FastAPI backend, vector embeddings for citation-backed responses. Its a real production system that ' |
| 'investigating officers actually use every day. This gave me solid experience with large scale NLP, ' |
| 'document processing, and building things that need to work reliably.' |
| ) |
|
|
| doc.add_paragraph( |
| 'Before that I did an ML internship at GGS Information Services where I worked on ' |
| '3D model compression (custom GANs, STEP file processing) and optimized inference pipelines ' |
| 'using CUDA kernels and model quantization β 30% latency reduction, 50% memory decrease. ' |
| 'The CUDA and quantization experience is directly relevant to working with large language models.' |
| ) |
|
|
| doc.add_heading('Relevant projects', level=3) |
|
|
| doc.add_paragraph( |
| 'Parameter Golf (OpenAI, March 2026): Train the best LM that fits in 16MB. ' |
| 'I designed the transformer architecture from scratch β 11 layers, 512 dim, 8 heads. ' |
| 'Custom SentencePiece tokenizers, GPTQ quantization, flash attention. ' |
| 'This isnt tutorial-level stuff, I understand how transformers train from the ground up.' |
| ) |
|
|
| doc.add_paragraph( |
| 'GSPO-DeepSeek-R1-Distill-Qwen-1.5B (15 GitHub stars): Implemented the GSPO algorithm from ' |
| 'the Qwen team. Got 60% accuracy on ZebraLogic, 75.8% on math. ' |
| 'Beat PPO and GRPO baselines.' |
| ) |
|
|
| doc.add_paragraph( |
| 'Dial 112 AI: Built a production system for AP Police processing 1000+ emergency calls daily. ' |
| 'Speech-to-text, sentiment analysis, priority classification, geospatial dispatch optimization.' |
| ) |
|
|
| doc.add_paragraph( |
| 'Published a paper on pose-guided image generation (PPAG) β ' |
| 'progressive pose attention for identity-preserving synthesis, 92% identity preservation score.' |
| ) |
|
|
| doc.add_heading('Open source', level=3) |
|
|
| doc.add_paragraph( |
| 'PRs to HuggingFace transformers, Unsloth (aarch64 support), Anthropic SDK (streaming perf fix), ' |
| 'OpenAI SDK, Karpathys nanochat (NaN loss fix for SFT training).' |
| ) |
|
|
| doc.add_paragraph( |
| 'DeepChem contributions: PR #4913 (ChemBERTa import fix for transformers 5.x), ' |
| 'Issue #4912 (transformers compatibility report).' |
| ) |
|
|
| doc.add_heading('Why this project', level=3) |
|
|
| doc.add_paragraph( |
| '[FILL THIS IN YOURSELF β what got you interested in molecular ML? ' |
| 'Why DeepChem specifically? Be honest here, mentors can tell when someone is just saying ' |
| 'what they think you want to hear. Talk about what actually excites you.]' |
| ) |
|
|
| p = doc.add_paragraph() |
| p.add_run('GitHub: ').bold = True |
| p.add_run('https://github.com/vivekvar-dl') |
| p = doc.add_paragraph() |
| p.add_run('LinkedIn: ').bold = True |
| p.add_run('https://linkedin.com/in/vivekvar') |
| p = doc.add_paragraph() |
| p.add_run('Bug fix PR: ').bold = True |
| p.add_run('https://github.com/deepchem/deepchem/pull/4913') |
| p = doc.add_paragraph() |
| p.add_run('Compat issue: ').bold = True |
| p.add_run('https://github.com/deepchem/deepchem/issues/4912') |
|
|
| |
| doc.add_heading('Work Plan', level=2) |
|
|
| doc.add_paragraph( |
| 'The project breaks down into four components that build on each other. ' |
| 'First the base class needs to learn about causal LMs, then we build the OLMo specific stuff on top, ' |
| 'then pretraining and benchmarks, then generation and docs.' |
| ) |
|
|
| |
| doc.add_heading('Design and Pseudocode', level=3) |
|
|
| doc.add_paragraph().add_run('Component 1: HuggingFaceModel base class changes').bold = True |
| doc.add_paragraph( |
| 'The wrapper currently doesnt know causal LMs exist. Need to add "causal_lm" as a task type ' |
| 'with DataCollatorForLanguageModeling(mlm=False), an AutoModelForCausalLM branch in ' |
| 'load_from_pretrained(), a generate() method wrapping HFs model.generate() API, and ' |
| '_prepare_batch() handling where labels = input_ids for next token prediction.' |
| ) |
| doc.add_paragraph( |
| 'Note: PR #4907 from another contributor adds a generate() method too. ' |
| 'My work is different β building a complete model wrapper, not just generation. ' |
| 'Will coordinate to avoid overlap.' |
| ) |
|
|
| doc.add_paragraph().add_run('Component 2: Olmo2ForSequenceClassification').bold = True |
| doc.add_paragraph( |
| 'This class doesnt exist in HuggingFace. I already built it. ' |
| 'Takes Olmo2Model output, grabs the last non-padded tokens hidden state (last-token pooling), ' |
| 'projects through a linear layer. Loss computation depends on problem_type β ' |
| 'MSELoss for regression, CrossEntropyLoss for single label, BCEWithLogitsLoss for multi label. ' |
| 'About 100 lines, follows same pattern as LlamaForSequenceClassification.' |
| ) |
|
|
| doc.add_paragraph('Forward pass pseudocode:') |
| doc.add_paragraph( |
| 'hidden = base_model(input_ids, attention_mask)\n' |
| 'seq_lengths = (input_ids != pad_id).sum(-1) - 1\n' |
| 'pooled = hidden[range(batch_size), seq_lengths]\n' |
| 'logits = linear_head(pooled)\n' |
| 'loss = compute_loss(logits, labels, problem_type)', |
| style='No Spacing' |
| ) |
|
|
| doc.add_paragraph().add_run('Component 3: OLMo wrapper class').bold = True |
| doc.add_paragraph( |
| 'User-facing class extending HuggingFaceModel. Same pattern as ChemBERTa and MoLFormer.' |
| ) |
| doc.add_paragraph( |
| 'OLMo.__init__(task, model_name, n_tasks, config)\n' |
| ' task: causal_lm | regression | classification | mtr\n' |
| ' Loads tokenizer, sets pad_token = eos_token\n' |
| ' Syncs vocab_size with tokenizer\n' |
| ' Creates right model class based on task\n\n' |
| 'OLMo._prepare_batch(batch)\n' |
| ' causal_lm: labels = input_ids clone\n' |
| ' regression: float labels\n' |
| ' classification: long (single) or float (multi)', |
| style='No Spacing' |
| ) |
|
|
| doc.add_paragraph().add_run('Component 4: Tokenization strategy').bold = True |
| doc.add_paragraph( |
| 'Phase 1: Use OLMos tokenizer directly on SMILES. Works β ' |
| 'my experiments show 0.67 AUC and 0.37 R2 from random init which means learning is happening. ' |
| 'Not perfect with stereocenters but functional.' |
| ) |
| doc.add_paragraph( |
| 'Stretch: Extend vocab with chemistry tokens β [C@@H], [nH], (=O), aromatic ring patterns. ' |
| 'Retrain BPE on mixed English + SMILES corpus. Could help generation quality a lot.' |
| ) |
|
|
| |
| doc.add_heading('Testing Plan', level=3) |
|
|
| doc.add_paragraph('I already have 8 unit tests written and passing:') |
|
|
| tests = [ |
| ('test_olmo_causal_lm_pretraining', 'Trains with causal LM objective, loss is positive'), |
| ('test_olmo_regression_finetuning', 'Regression training, predictions have correct shape, MAE score works'), |
| ('test_olmo_classification', 'Binary classification on random labels'), |
| ('test_olmo_multitask_regression', '2-task regression, output shape matches'), |
| ('test_olmo_save_and_restore', 'Save checkpoint, load into new model, verify all weights match'), |
| ('test_olmo_load_from_pretrained', 'Pretrain causal LM then load weights into regression model'), |
| ('test_olmo_generate', 'Single string and batch generation, returns valid strings'), |
| ('test_olmo_invalid_task', 'Bad task name raises ValueError'), |
| ] |
| for name, desc in tests: |
| doc.add_paragraph(f'{name} β {desc}', style='List Bullet') |
|
|
| doc.add_paragraph( |
| 'All tests use a tiny config (64 hidden, 2 layers, 2 heads). No model downloads. ' |
| 'Runs in 27 seconds on CPU.' |
| ) |
|
|
| doc.add_paragraph('Integration tests to add during GSoC:') |
| doc.add_paragraph('MoleculeNet benchmarks β BBBP, ESOL, FreeSolv, Lipophilicity', style='List Bullet') |
| doc.add_paragraph('SMILES generation validity checking with RDKit', style='List Bullet') |
| doc.add_paragraph('Pretraining convergence curves on ZINC subsets', style='List Bullet') |
|
|
| |
| doc.add_heading('Sources of Risk', level=3) |
|
|
| doc.add_paragraph().add_run('Generation quality:').bold = True |
| doc.add_paragraph( |
| 'This is the biggest one. 0% validity from random init is expected but getting to ' |
| 'something useful like 50%+ valid SMILES needs real pretraining on a molecular corpus. ' |
| 'Im allocating 3 full weeks for pretraining experiments using ZINC-250K as training data. ' |
| 'If ZINC isnt enough, backup plan is PubChem subsets or combining multiple datasets.' |
| ) |
|
|
| doc.add_paragraph().add_run('GPU memory for 7B model:').bold = True |
| doc.add_paragraph( |
| 'OLMo-7B needs ~14GB VRAM just for inference. ' |
| 'CI tests use tiny configs so no GPU needed. Demos use OLMo-1B. ' |
| '7B experiments may need multi-GPU, will figure out with mentor.' |
| ) |
|
|
| doc.add_paragraph().add_run('Olmo2ForSequenceClassification not upstream:').bold = True |
| doc.add_paragraph( |
| 'I wrote this class myself since HuggingFace doesnt have it. ' |
| 'If they add one later we just swap ours out. Follows their patterns exactly so low risk.' |
| ) |
|
|
| doc.add_paragraph().add_run('Transformers version compatibility:').bold = True |
| doc.add_paragraph( |
| 'Already found and fixed one issue (PR #4913). ' |
| 'Using top-level imports everywhere going forward. Will test against both 4.x and 5.x.' |
| ) |
|
|
| |
| doc.add_heading('Milestones', level=3) |
|
|
| doc.add_paragraph().add_run('Milestone 1 (end of week 3): Core wrapper working').bold = True |
| doc.add_paragraph( |
| 'from deepchem.models import OLMo works. All task modes functional. ' |
| 'Tests passing. Base class changes merged.' |
| ) |
|
|
| doc.add_paragraph().add_run('Milestone 2 (end of week 6): Pretraining pipeline done').bold = True |
| doc.add_paragraph( |
| 'Can load SMILES data, pretrain OLMo with causal LM objective, save checkpoints. ' |
| 'Tutorial notebook showing how to pretrain on custom data. ' |
| 'First generation results with validity numbers.' |
| ) |
|
|
| doc.add_paragraph().add_run('Milestone 3 (end of week 12): Everything shipped').bold = True |
| doc.add_paragraph( |
| 'MoleculeNet benchmark results published. Generation tutorial with RDKit validation. ' |
| 'Full documentation. All PRs merged.' |
| ) |
|
|
| |
| doc.add_heading('Timeline', level=3) |
|
|
| doc.add_paragraph('12 weeks, Medium size (175 hours):') |
|
|
| weeks = [ |
| ('Week 1', 'Set up dev environment, submit base class PR (causal_lm task support, ~40 lines). Start on generate() method.'), |
| ('Week 2', 'Submit generate() PR (~50 lines). Start Olmo2ForSequenceClassification. Discuss size with mentor since its ~100 lines.'), |
| ('Week 3', 'Submit OLMo wrapper PR and unit test PR. Get through review. Target: all core PRs merged by end of week.'), |
| ('Week 4', 'Start pretraining pipeline. Data loading for ZINC-250K. Figure out training hyperparameters on OLMo-1B.'), |
| ('Week 5', 'Run pretraining experiments. Monitor convergence. First generation attempts β check validity with RDKit.'), |
| ('Week 6', 'Finish pretraining PR and tutorial notebook. Submit for review. Save best checkpoint.'), |
| ('Week 7', 'Start finetuning experiments. BBBP and Tox21 classification with pretrained model.'), |
| ('Week 8', 'ESOL, FreeSolv, Lipophilicity regression. Build benchmark comparison table vs ChemBERTa.'), |
| ('Week 9', 'Submit benchmark PR and finetuning tutorials. Respond to review feedback.'), |
| ('Week 10', 'SMILES generation experiments. Validity rate analysis. Different sampling strategies (temperature, top-k, nucleus).'), |
| ('Week 11', 'Generation tutorial notebook. Documentation β numpydoc for all classes/methods, API reference updates.'), |
| ('Week 12', 'Final review rounds. Clean up any open PRs. Stretch: tokenizer extension experiments if time allows.'), |
| ] |
| for week, desc in weeks: |
| p = doc.add_paragraph() |
| p.add_run(f'{week}: ').bold = True |
| p.add_run(desc) |
|
|
| |
| doc.add_heading('Pull Requests', level=3) |
|
|
| doc.add_paragraph( |
| 'Following DeepChems guidelines for new contributors β small PRs, especially at the start. ' |
| 'Heres the planned breakdown:' |
| ) |
|
|
| prs = [ |
| ('PR 1 (Week 1)', 'HuggingFaceModel causal_lm support', '~40 lines'), |
| ('PR 2 (Week 2)', 'generate() method', '~50 lines'), |
| ('PR 3 (Week 2-3)', 'Olmo2ForSequenceClassification', '~100 lines (will discuss with mentor)'), |
| ('PR 4 (Week 3)', 'OLMo wrapper class', '~80 lines'), |
| ('PR 5 (Week 3)', 'Unit tests', '~180 lines'), |
| ('PR 6 (Week 6)', 'Pretraining pipeline + data utils', 'TBD'), |
| ('PR 7 (Week 6)', 'Pretraining tutorial notebook', 'notebook'), |
| ('PR 8 (Week 9)', 'Finetuning tutorials (classification + regression)', 'notebooks'), |
| ('PR 9 (Week 9)', 'Benchmark results', 'TBD'), |
| ('PR 10 (Week 11)', 'Generation tutorial', 'notebook'), |
| ('PR 11 (Week 12)', 'Documentation updates', 'TBD'), |
| ] |
| for pr, desc, size in prs: |
| p = doc.add_paragraph() |
| p.add_run(f'{pr}: ').bold = True |
| p.add_run(f'{desc} ({size})') |
|
|
| doc.add_paragraph( |
| 'Each PR goes through review at office hours. Bigger ones might need 2-3 rounds. ' |
| 'I have buffer built into the timeline for this β not assuming everything merges on first try.' |
| ) |
|
|
| |
| doc.add_heading('Community', level=2) |
|
|
| doc.add_paragraph('What Ive done so far:') |
| doc.add_paragraph('PR #4913 β fixed ChemBERTa import crash for transformers 5.x', style='List Bullet') |
| doc.add_paragraph('Issue #4912 β reported broader transformers 5.x compatibility problems', style='List Bullet') |
| doc.add_paragraph('Built and tested OLMo prototype locally against DeepChems codebase', style='List Bullet') |
|
|
| doc.add_paragraph( |
| 'I can commit to attending at least 2 office hour sessions per week (MWF 9am PST). ' |
| 'Will also be active on Discord for async discussion and will do weekly progress updates.' |
| ) |
|
|
| doc.add_paragraph( |
| '[NOTE: mention which mentors you have talked to once you connect on Discord. ' |
| 'Riya and Harindhar are listed as mentors for this project.]' |
| ) |
|
|
| |
| doc.add_heading('Resources Required', level=2) |
|
|
| doc.add_paragraph( |
| 'I have GPU access through Azure cloud which should handle OLMo-1B training and OLMo-7B inference. ' |
| 'For CI and unit tests everything runs on CPU with tiny model configs so no special compute needed there.' |
| ) |
|
|
| doc.add_paragraph( |
| 'For OLMo-7B full training we might need a multi-GPU setup. ' |
| 'Would be good to discuss with the mentor what compute DeepChem can provide ' |
| 'or if Colab Pro / cloud credits would work. ' |
| 'The pretraining experiments on OLMo-1B should be doable on my current setup.' |
| ) |
|
|
| |
| doc.add_heading('References', level=2) |
|
|
| refs = [ |
| 'Groeneveld et al. (2024). OLMo: Accelerating the Science of Language Models. arXiv:2402.00838', |
| 'Chithrananda et al. (2020). ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. arXiv:2010.09885', |
| 'Ross et al. (2022). Large-Scale Chemical Language Representations Capture Molecular Structure and Properties. Nature Machine Intelligence', |
| 'Weininger (1988). SMILES, a chemical language and information system. J Chem Inf Comput Sci', |
| 'Wu et al. (2018). MoleculeNet: A Benchmark for Molecular Machine Learning. Chemical Science', |
| ] |
| for i, ref in enumerate(refs, 1): |
| doc.add_paragraph(f'{i}. {ref}') |
|
|
| |
| doc.save('/home/azureuser/GSoC_2026_Proposal_Vivek_OLMo.docx') |
| print('Done: ~/GSoC_2026_Proposal_Vivek_OLMo.docx') |
|
|