Synapse

synapse

AI & ML interests

machine learning, NLP, speech and signal processing

Recent Activity

commentedon an article 5 days ago

SmolLM3: smol, multilingual, long-context reasoner

commentedon an article 8 days ago

SmolLM3: smol, multilingual, long-context reasoner

commentedon an article 8 days ago

SmolLM3: smol, multilingual, long-context reasoner

View all activity

Organizations

None yet

commentedon SmolLM3: smol, multilingual, long-context reasoner 5 days ago

I appreciate your time, a couple of quick more questions:

https://huggingface.co/datasets/HuggingFaceFW/finepdfs-edu Would you consider adding this dataset to the pre-training mixture in a future pre-training?
In the SmolLM2, you guys reported experiments with ratio values between FineWeb-Edu and DCLM (like, 60%-40% or 40%-60% later on. These were determined from the ablations you guys did.
But looking here, on SmolLM3, https://github.com/huggingface/smollm/blob/main/text/pretraining/smollm3/stage1_8T.yaml
for all the stages, FineWeb-Edu (33%) and DCLM (37%). Were there any ablations for arriving to these weights or?

commentedon SmolLM3: smol, multilingual, long-context reasoner 8 days ago

Oh, actually I found it. I was searching for this https://huggingface.co/datasets/HuggingFaceTB/smollm3-configs

commentedon SmolLM3: smol, multilingual, long-context reasoner 8 days ago

In SmalLM2 you guys reported the weights for each dataset, for example: FineWeb-Edu (60%), DCLM (40%). Where can we find those specifics, for instance, what were the weights for each Web dataset?:
Web: 85% (12% multilingual) - FineWeb-Edu, DCLM, FineWeb2 and FineWeb2-HQ