Papers
arxiv:2606.12385

Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

Published on Jun 10
· Submitted by
Haoxiang Sun
on Jun 11
Authors:
,
,

Abstract

ModSleuth is an agentic system that recursively reconstructs large-scale dependency graphs for LLM development by analyzing public artifacts and resolving inconsistencies in documentation and artifact identities.

Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose own dependencies are documented only in separate releases and artifacts. As a result, the full dependency structure is fragmented across heterogeneous public artifacts, with complexity and recursive depth far outpacing humans' ability to trace. We introduce ModSleuth, an agentic system that recursively reconstructs LLM dependency graphs from public artifacts with source-grounded evidence. We find that the primary challenge is no longer information extraction, but defining what constitutes a dependency and reconciling artifact references across inconsistent documentation. We address these challenges through a formalization that distinguishes direct and indirect dependencies, represents heterogeneous pipeline roles through operation-centered relationships, and resolves artifact identities across names, versions, and repositories. Applying ModSleuth to four public-artifact-rich LLM releases, we recover 1,060 source-verified dependencies and construct large-scale dependency graphs of modern LLM development. These graphs reveal multi-hop license obligations, train-evaluation coupling, discrepancies between released and training-time artifacts, and documentation inconsistencies that would otherwise be difficult to uncover. We release ModSleuth and the resulting dependency graphs to support transparent analysis of the increasingly complex ecosystems underlying modern LLMs.

Community

Paper submitter

We introduce ModSleuth, an agentic system that recursively traces LLM dependencies from public artifacts. A key finding from building this was that extraction itself isn't the hard part anymore — the real challenges are defining what counts as a dependency in modern pipelines (where models serve as judges, filters, OCR systems, data generators, and more) and reconciling artifact identities across inconsistent documentation. We formalize these through a distinction between direct and indirect dependencies, operation-centered relationship representations, and an identity lattice for resolving ambiguous references.

Applying ModSleuth to four LLM releases, we recover large-scale dependency graphs that surface issues difficult to find manually: license-relevant multi-hop paths where upstream terms propagate silently through synthetic data, structural train-eval coupling where benchmarks appear on both sides of the pipeline, and documentation gaps where code reveals dependencies the paper never mentions. We'd love to hear your feedback on this project!

Code: https://github.com/cal-data-audit/modsleuth
Demo: https://modsleuth.cal-data-audit.org

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.12385
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.12385 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.12385 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.12385 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.