azure-scripts / make_proposal_doc.py

azure home scripts: data gen, training, misc

a70eb3d verified 10 days ago

20.3 kB

	from docx import Document
	from docx.shared import Pt, Inches
	from docx.enum.text import WD_ALIGN_PARAGRAPH

	doc = Document()

	style = doc.styles['Normal']
	font = style.font
	font.name = 'Arial'
	font.size = Pt(11)

	# Title
	title = doc.add_heading('LLM Support for 7B Models (OLMo) in DeepChem', level=1)
	title.alignment = WD_ALIGN_PARAGRAPH.CENTER

	p = doc.add_paragraph()
	p.alignment = WD_ALIGN_PARAGRAPH.CENTER
	run = p.add_run('V SaraVivek (Vivek Varikuti)')
	run.bold = True
	run.font.size = Pt(12)
	p.add_run('\nvivekvarikuti22@gmail.com \| github.com/vivekvar-dl \| vivekvari.dev')

	doc.add_paragraph('')

	# ============ INTRODUCTION ============
	doc.add_heading('Introduction', level=2)

	doc.add_paragraph(
	'DeepChems HuggingFaceModel wrapper does a solid job with encoder models. '
	'ChemBERTa, MoLFormer, masked language modeling, classification, regression — all of that works. '
	'But theres a pretty fundamental gap. It has zero support for decoder-only causal language models. '
	'No GPT-style models, no text generation, nothing autoregressive.'
	)

	doc.add_paragraph(
	'This project adds OLMo-2 (Allen AIs open language model) to DeepChem. '
	'The goal isnt just to wrap another HuggingFace model though. '
	'Its to make a causal LM genuinely useful for molecular science. '
	'That means continued pretraining on SMILES strings, finetuning for property prediction tasks, '
	'and most importantly — generating new molecules.'
	)

	doc.add_paragraph(
	'Why does this matter for the community? ChemBERTa can tell you about a molecule but it cant make new ones. '
	'Its an encoder, thats not what encoders do. A causal LM like OLMo can actually produce novel SMILES strings. '
	'For drug discovery and molecular design thats a really big deal. '
	'And because OLMo was pretrained on the Dolma corpus (which includes tons of scientific papers), '
	'it already has some chemistry knowledge baked in before you even start finetuning on molecular data.'
	)

	doc.add_paragraph(
	'I picked OLMo specifically because its fully open — weights, training data, code, all of it. '
	'No license restrictions like LLaMA. OLMo-2 is natively supported in HuggingFace transformers so you '
	'dont need custom packages. And it has 1B and 7B variants which is nice for development vs production.'
	)

	doc.add_heading('What I already built', level=3)

	doc.add_paragraph(
	'I cloned DeepChem and started building before writing this proposal. '
	'First thing that happened — ChemBERTa wouldnt import. The module '
	'transformers.models.roberta.tokenization_roberta_fast got removed in transformers 5.x '
	'and nobody had caught it. I fixed it in PR #4913 and filed issue #4912 about the broader '
	'transformers 5.x compatibility problems. Thats how I got into the codebase.'
	)

	doc.add_paragraph(
	'Then I started on the OLMo wrapper. Hit an interesting problem right away — '
	'HuggingFace doesnt have OlmoForSequenceClassification. The class just doesnt exist. '
	'So you cant do regression or classification with OLMo out of the box. '
	'I wrote one from scratch using last-token pooling (same approach as LlamaForSequenceClassification). '
	'Also added causal_lm as a task type in HuggingFaceModel and built a generate() method.'
	)

	doc.add_paragraph(
	'Ran experiments on MoleculeNet to see if the thing actually works:'
	)

	doc.add_paragraph(
	'BBBP (blood brain barrier): ROC-AUC 0.67 — tiny random init model, 200 samples, 3 epochs. '
	'Not amazing but clearly above random chance so the architecture learns.', style='List Bullet'
	)
	doc.add_paragraph(
	'ESOL (solubility): R squared 0.37 — same conditions.', style='List Bullet'
	)
	doc.add_paragraph(
	'SMILES generation: 0% valid molecules. Everything it generated was broken SMILES.', style='List Bullet'
	)

	doc.add_paragraph(
	'That 0% generation result is actually the most important finding. '
	'It tells you the wrapper works fine, training works fine, but without serious pretraining on '
	'a molecular corpus the model just outputs nonsense. Thats the core problem this project needs to solve.'
	)

	doc.add_paragraph(
	'I also compared tokenization — OLMo vs ChemBERTa on real drugs (aspirin, caffeine, penicillin, paclitaxel). '
	'OLMo uses fewer tokens overall (100K vocab vs 600) but fragments chemical concepts. '
	'[C@@H] which is one thing in chemistry (a stereocenter) gets split into 4 tokens. '
	'ChemBERTa learned better groupings like (=O) for carbonyl. '
	'Something to address in the stretch goals.'
	)

	doc.add_paragraph('8 unit tests written, all passing in 27 seconds on CPU.')

	# ============ RELEVANT EXPERIENCE ============
	doc.add_heading('Relevant Experience and Interest', level=2)

	doc.add_paragraph(
	'Im Vivek, I just finished my B.Tech in AI & ML from Usha Rama College of Engineering (2021-2025). '
	'Ive been working as an AI Engineer for the past year building production systems.'
	)

	doc.add_heading('Work experience', level=3)

	doc.add_paragraph(
	'Right now Im working with the Andhra Pradesh Police on a government AI initiative (AI4AP). '
	'I built an AI legal compliance system for POCSO cases — RAG pipeline over 1000+ legal documents, '
	'FastAPI backend, vector embeddings for citation-backed responses. Its a real production system that '
	'investigating officers actually use every day. This gave me solid experience with large scale NLP, '
	'document processing, and building things that need to work reliably.'
	)

	doc.add_paragraph(
	'Before that I did an ML internship at GGS Information Services where I worked on '
	'3D model compression (custom GANs, STEP file processing) and optimized inference pipelines '
	'using CUDA kernels and model quantization — 30% latency reduction, 50% memory decrease. '
	'The CUDA and quantization experience is directly relevant to working with large language models.'
	)

	doc.add_heading('Relevant projects', level=3)

	doc.add_paragraph(
	'Parameter Golf (OpenAI, March 2026): Train the best LM that fits in 16MB. '
	'I designed the transformer architecture from scratch — 11 layers, 512 dim, 8 heads. '
	'Custom SentencePiece tokenizers, GPTQ quantization, flash attention. '
	'This isnt tutorial-level stuff, I understand how transformers train from the ground up.'
	)

	doc.add_paragraph(
	'GSPO-DeepSeek-R1-Distill-Qwen-1.5B (15 GitHub stars): Implemented the GSPO algorithm from '
	'the Qwen team. Got 60% accuracy on ZebraLogic, 75.8% on math. '
	'Beat PPO and GRPO baselines.'
	)

	doc.add_paragraph(
	'Dial 112 AI: Built a production system for AP Police processing 1000+ emergency calls daily. '
	'Speech-to-text, sentiment analysis, priority classification, geospatial dispatch optimization.'
	)

	doc.add_paragraph(
	'Published a paper on pose-guided image generation (PPAG) — '
	'progressive pose attention for identity-preserving synthesis, 92% identity preservation score.'
	)

	doc.add_heading('Open source', level=3)

	doc.add_paragraph(
	'PRs to HuggingFace transformers, Unsloth (aarch64 support), Anthropic SDK (streaming perf fix), '
	'OpenAI SDK, Karpathys nanochat (NaN loss fix for SFT training).'
	)

	doc.add_paragraph(
	'DeepChem contributions: PR #4913 (ChemBERTa import fix for transformers 5.x), '
	'Issue #4912 (transformers compatibility report).'
	)

	doc.add_heading('Why this project', level=3)

	doc.add_paragraph(
	'[FILL THIS IN YOURSELF — what got you interested in molecular ML? '
	'Why DeepChem specifically? Be honest here, mentors can tell when someone is just saying '
	'what they think you want to hear. Talk about what actually excites you.]'
	)

	p = doc.add_paragraph()
	p.add_run('GitHub: ').bold = True
	p.add_run('https://github.com/vivekvar-dl')
	p = doc.add_paragraph()
	p.add_run('LinkedIn: ').bold = True
	p.add_run('https://linkedin.com/in/vivekvar')
	p = doc.add_paragraph()
	p.add_run('Bug fix PR: ').bold = True
	p.add_run('https://github.com/deepchem/deepchem/pull/4913')
	p = doc.add_paragraph()
	p.add_run('Compat issue: ').bold = True
	p.add_run('https://github.com/deepchem/deepchem/issues/4912')

	# ============ WORK PLAN ============
	doc.add_heading('Work Plan', level=2)

	doc.add_paragraph(
	'The project breaks down into four components that build on each other. '
	'First the base class needs to learn about causal LMs, then we build the OLMo specific stuff on top, '
	'then pretraining and benchmarks, then generation and docs.'
	)

	# Design
	doc.add_heading('Design and Pseudocode', level=3)

	doc.add_paragraph().add_run('Component 1: HuggingFaceModel base class changes').bold = True
	doc.add_paragraph(
	'The wrapper currently doesnt know causal LMs exist. Need to add "causal_lm" as a task type '
	'with DataCollatorForLanguageModeling(mlm=False), an AutoModelForCausalLM branch in '
	'load_from_pretrained(), a generate() method wrapping HFs model.generate() API, and '
	'_prepare_batch() handling where labels = input_ids for next token prediction.'
	)
	doc.add_paragraph(
	'Note: PR #4907 from another contributor adds a generate() method too. '
	'My work is different — building a complete model wrapper, not just generation. '
	'Will coordinate to avoid overlap.'
	)

	doc.add_paragraph().add_run('Component 2: Olmo2ForSequenceClassification').bold = True
	doc.add_paragraph(
	'This class doesnt exist in HuggingFace. I already built it. '
	'Takes Olmo2Model output, grabs the last non-padded tokens hidden state (last-token pooling), '
	'projects through a linear layer. Loss computation depends on problem_type — '
	'MSELoss for regression, CrossEntropyLoss for single label, BCEWithLogitsLoss for multi label. '
	'About 100 lines, follows same pattern as LlamaForSequenceClassification.'
	)

	doc.add_paragraph('Forward pass pseudocode:')
	doc.add_paragraph(
	'hidden = base_model(input_ids, attention_mask)\n'
	'seq_lengths = (input_ids != pad_id).sum(-1) - 1\n'
	'pooled = hidden[range(batch_size), seq_lengths]\n'
	'logits = linear_head(pooled)\n'
	'loss = compute_loss(logits, labels, problem_type)',
	style='No Spacing'
	)

	doc.add_paragraph().add_run('Component 3: OLMo wrapper class').bold = True
	doc.add_paragraph(
	'User-facing class extending HuggingFaceModel. Same pattern as ChemBERTa and MoLFormer.'
	)
	doc.add_paragraph(
	'OLMo.__init__(task, model_name, n_tasks, config)\n'
	' task: causal_lm \| regression \| classification \| mtr\n'
	' Loads tokenizer, sets pad_token = eos_token\n'
	' Syncs vocab_size with tokenizer\n'
	' Creates right model class based on task\n\n'
	'OLMo._prepare_batch(batch)\n'
	' causal_lm: labels = input_ids clone\n'
	' regression: float labels\n'
	' classification: long (single) or float (multi)',
	style='No Spacing'
	)

	doc.add_paragraph().add_run('Component 4: Tokenization strategy').bold = True
	doc.add_paragraph(
	'Phase 1: Use OLMos tokenizer directly on SMILES. Works — '
	'my experiments show 0.67 AUC and 0.37 R2 from random init which means learning is happening. '
	'Not perfect with stereocenters but functional.'
	)
	doc.add_paragraph(
	'Stretch: Extend vocab with chemistry tokens — [C@@H], [nH], (=O), aromatic ring patterns. '
	'Retrain BPE on mixed English + SMILES corpus. Could help generation quality a lot.'
	)

	# Testing
	doc.add_heading('Testing Plan', level=3)

	doc.add_paragraph('I already have 8 unit tests written and passing:')

	tests = [
	('test_olmo_causal_lm_pretraining', 'Trains with causal LM objective, loss is positive'),
	('test_olmo_regression_finetuning', 'Regression training, predictions have correct shape, MAE score works'),
	('test_olmo_classification', 'Binary classification on random labels'),
	('test_olmo_multitask_regression', '2-task regression, output shape matches'),
	('test_olmo_save_and_restore', 'Save checkpoint, load into new model, verify all weights match'),
	('test_olmo_load_from_pretrained', 'Pretrain causal LM then load weights into regression model'),
	('test_olmo_generate', 'Single string and batch generation, returns valid strings'),
	('test_olmo_invalid_task', 'Bad task name raises ValueError'),
	]
	for name, desc in tests:
	doc.add_paragraph(f'{name} — {desc}', style='List Bullet')

	doc.add_paragraph(
	'All tests use a tiny config (64 hidden, 2 layers, 2 heads). No model downloads. '
	'Runs in 27 seconds on CPU.'
	)

	doc.add_paragraph('Integration tests to add during GSoC:')
	doc.add_paragraph('MoleculeNet benchmarks — BBBP, ESOL, FreeSolv, Lipophilicity', style='List Bullet')
	doc.add_paragraph('SMILES generation validity checking with RDKit', style='List Bullet')
	doc.add_paragraph('Pretraining convergence curves on ZINC subsets', style='List Bullet')

	# Risks
	doc.add_heading('Sources of Risk', level=3)

	doc.add_paragraph().add_run('Generation quality:').bold = True
	doc.add_paragraph(
	'This is the biggest one. 0% validity from random init is expected but getting to '
	'something useful like 50%+ valid SMILES needs real pretraining on a molecular corpus. '
	'Im allocating 3 full weeks for pretraining experiments using ZINC-250K as training data. '
	'If ZINC isnt enough, backup plan is PubChem subsets or combining multiple datasets.'
	)

	doc.add_paragraph().add_run('GPU memory for 7B model:').bold = True
	doc.add_paragraph(
	'OLMo-7B needs ~14GB VRAM just for inference. '
	'CI tests use tiny configs so no GPU needed. Demos use OLMo-1B. '
	'7B experiments may need multi-GPU, will figure out with mentor.'
	)

	doc.add_paragraph().add_run('Olmo2ForSequenceClassification not upstream:').bold = True
	doc.add_paragraph(
	'I wrote this class myself since HuggingFace doesnt have it. '
	'If they add one later we just swap ours out. Follows their patterns exactly so low risk.'
	)

	doc.add_paragraph().add_run('Transformers version compatibility:').bold = True
	doc.add_paragraph(
	'Already found and fixed one issue (PR #4913). '
	'Using top-level imports everywhere going forward. Will test against both 4.x and 5.x.'
	)

	# Milestones
	doc.add_heading('Milestones', level=3)

	doc.add_paragraph().add_run('Milestone 1 (end of week 3): Core wrapper working').bold = True
	doc.add_paragraph(
	'from deepchem.models import OLMo works. All task modes functional. '
	'Tests passing. Base class changes merged.'
	)

	doc.add_paragraph().add_run('Milestone 2 (end of week 6): Pretraining pipeline done').bold = True
	doc.add_paragraph(
	'Can load SMILES data, pretrain OLMo with causal LM objective, save checkpoints. '
	'Tutorial notebook showing how to pretrain on custom data. '
	'First generation results with validity numbers.'
	)

	doc.add_paragraph().add_run('Milestone 3 (end of week 12): Everything shipped').bold = True
	doc.add_paragraph(
	'MoleculeNet benchmark results published. Generation tutorial with RDKit validation. '
	'Full documentation. All PRs merged.'
	)

	# Timeline
	doc.add_heading('Timeline', level=3)

	doc.add_paragraph('12 weeks, Medium size (175 hours):')

	weeks = [
	('Week 1', 'Set up dev environment, submit base class PR (causal_lm task support, ~40 lines). Start on generate() method.'),
	('Week 2', 'Submit generate() PR (~50 lines). Start Olmo2ForSequenceClassification. Discuss size with mentor since its ~100 lines.'),
	('Week 3', 'Submit OLMo wrapper PR and unit test PR. Get through review. Target: all core PRs merged by end of week.'),
	('Week 4', 'Start pretraining pipeline. Data loading for ZINC-250K. Figure out training hyperparameters on OLMo-1B.'),
	('Week 5', 'Run pretraining experiments. Monitor convergence. First generation attempts — check validity with RDKit.'),
	('Week 6', 'Finish pretraining PR and tutorial notebook. Submit for review. Save best checkpoint.'),
	('Week 7', 'Start finetuning experiments. BBBP and Tox21 classification with pretrained model.'),
	('Week 8', 'ESOL, FreeSolv, Lipophilicity regression. Build benchmark comparison table vs ChemBERTa.'),
	('Week 9', 'Submit benchmark PR and finetuning tutorials. Respond to review feedback.'),
	('Week 10', 'SMILES generation experiments. Validity rate analysis. Different sampling strategies (temperature, top-k, nucleus).'),
	('Week 11', 'Generation tutorial notebook. Documentation — numpydoc for all classes/methods, API reference updates.'),
	('Week 12', 'Final review rounds. Clean up any open PRs. Stretch: tokenizer extension experiments if time allows.'),
	]
	for week, desc in weeks:
	p = doc.add_paragraph()
	p.add_run(f'{week}: ').bold = True
	p.add_run(desc)

	# Pull Requests
	doc.add_heading('Pull Requests', level=3)

	doc.add_paragraph(
	'Following DeepChems guidelines for new contributors — small PRs, especially at the start. '
	'Heres the planned breakdown:'
	)

	prs = [
	('PR 1 (Week 1)', 'HuggingFaceModel causal_lm support', '~40 lines'),
	('PR 2 (Week 2)', 'generate() method', '~50 lines'),
	('PR 3 (Week 2-3)', 'Olmo2ForSequenceClassification', '~100 lines (will discuss with mentor)'),
	('PR 4 (Week 3)', 'OLMo wrapper class', '~80 lines'),
	('PR 5 (Week 3)', 'Unit tests', '~180 lines'),
	('PR 6 (Week 6)', 'Pretraining pipeline + data utils', 'TBD'),
	('PR 7 (Week 6)', 'Pretraining tutorial notebook', 'notebook'),
	('PR 8 (Week 9)', 'Finetuning tutorials (classification + regression)', 'notebooks'),
	('PR 9 (Week 9)', 'Benchmark results', 'TBD'),
	('PR 10 (Week 11)', 'Generation tutorial', 'notebook'),
	('PR 11 (Week 12)', 'Documentation updates', 'TBD'),
	]
	for pr, desc, size in prs:
	p = doc.add_paragraph()
	p.add_run(f'{pr}: ').bold = True
	p.add_run(f'{desc} ({size})')

	doc.add_paragraph(
	'Each PR goes through review at office hours. Bigger ones might need 2-3 rounds. '
	'I have buffer built into the timeline for this — not assuming everything merges on first try.'
	)

	# ============ COMMUNITY ============
	doc.add_heading('Community', level=2)

	doc.add_paragraph('What Ive done so far:')
	doc.add_paragraph('PR #4913 — fixed ChemBERTa import crash for transformers 5.x', style='List Bullet')
	doc.add_paragraph('Issue #4912 — reported broader transformers 5.x compatibility problems', style='List Bullet')
	doc.add_paragraph('Built and tested OLMo prototype locally against DeepChems codebase', style='List Bullet')

	doc.add_paragraph(
	'I can commit to attending at least 2 office hour sessions per week (MWF 9am PST). '
	'Will also be active on Discord for async discussion and will do weekly progress updates.'
	)

	doc.add_paragraph(
	'[NOTE: mention which mentors you have talked to once you connect on Discord. '
	'Riya and Harindhar are listed as mentors for this project.]'
	)

	# ============ RESOURCES ============
	doc.add_heading('Resources Required', level=2)

	doc.add_paragraph(
	'I have GPU access through Azure cloud which should handle OLMo-1B training and OLMo-7B inference. '
	'For CI and unit tests everything runs on CPU with tiny model configs so no special compute needed there.'
	)

	doc.add_paragraph(
	'For OLMo-7B full training we might need a multi-GPU setup. '
	'Would be good to discuss with the mentor what compute DeepChem can provide '
	'or if Colab Pro / cloud credits would work. '
	'The pretraining experiments on OLMo-1B should be doable on my current setup.'
	)

	# ============ REFERENCES ============
	doc.add_heading('References', level=2)

	refs = [
	'Groeneveld et al. (2024). OLMo: Accelerating the Science of Language Models. arXiv:2402.00838',
	'Chithrananda et al. (2020). ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. arXiv:2010.09885',
	'Ross et al. (2022). Large-Scale Chemical Language Representations Capture Molecular Structure and Properties. Nature Machine Intelligence',
	'Weininger (1988). SMILES, a chemical language and information system. J Chem Inf Comput Sci',
	'Wu et al. (2018). MoleculeNet: A Benchmark for Molecular Machine Learning. Chemical Science',
	]
	for i, ref in enumerate(refs, 1):
	doc.add_paragraph(f'{i}. {ref}')

	# Save
	doc.save('/home/azureuser/GSoC_2026_Proposal_Vivek_OLMo.docx')
	print('Done: ~/GSoC_2026_Proposal_Vivek_OLMo.docx')