new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Apr 1

Hidden Licensing Risks in the LLMware Ecosystem

Large Language Models (LLMs) are increasingly integrated into software systems, giving rise to a new class of systems referred to as LLMware. Beyond traditional source-code components, LLMware embeds or interacts with LLMs that depend on other models and datasets, forming complex supply chains across open-source software (OSS), models, and datasets. However, licensing issues emerging from these intertwined dependencies remain largely unexplored. Leveraging GitHub and Hugging Face, we curate a large-scale dataset capturing LLMware supply chains, including 12,180 OSS repositories, 3,988 LLMs, and 708 datasets. Our analysis reveals that license distributions in LLMware differ substantially from traditional OSS ecosystems. We further examine license-related discussions and find that license selection and maintenance are the dominant concerns, accounting for 84% of cases. To understand incompatibility risks, we analyze license conflicts along supply chains and evaluate state-of-the-art detection approaches, which achieve only 58% and 76% F1 scores in this setting. Motivated by these limitations, we propose LiAgent, an LLM-based agent framework for ecosystem-level license compatibility analysis. LiAgent achieves an F1 score of 87%, improving performance by 14 percentage points over prior methods. We reported 60 incompatibility issues detected by LiAgent, 11 of which have been confirmed by developers. Notably, two conflicted LLMs have over 107 million and 5 million downloads on Hugging Face, respectively, indicating potentially widespread downstream impact. We conclude with implications and recommendations to support the sustainable growth of the LLMware ecosystem.

  • 8 authors
·
Feb 11

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.

  • 8 authors
·
Nov 24, 2025