arxiv:2603.10377

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

Published on Mar 11

· Submitted by

Noor Islam S. Mohammad on Mar 12

New York University

Upvote

Authors:

Abstract

Causal Concept Graphs leverage sparse autoencoders and differentiable structure learning to identify causal relationships between concepts in language models, demonstrating superior causal fidelity compared to alternative methods.

AI-generated summary

Sparse autoencoders can localize where concepts live in language models, but not how they interact during multi-step reasoning. We propose Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features, where edges capture learned causal dependencies between concepts. We combine task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning for graph recovery and introduce the Causal Fidelity Score (CFS) to evaluate whether graph-guided interventions induce larger downstream effects than random ones. On ARC-Challenge, StrategyQA, and LogiQA with GPT-2 Medium, across five seeds (n{=}15 paired runs), CCG achieves CFS=5.654pm0.625, outperforming ROME-style tracing (3.382pm0.233), SAE-only ranking (2.479pm0.196), and a random baseline (1.032pm0.034), with p<0.0001 after Bonferroni correction. Learned graphs are sparse (5-6\% edge density), domain-specific, and stable across seeds.

View arXiv page View PDF Add to collection

Community

mi2547nyu

Paper submitter about 12 hours ago

Short Daily Paper Discussion

This paper proposes Causal Concept Graphs (CCG) to better understand how concepts interact inside large language models. The authors combine task-conditioned sparse autoencoders (SAEs) for discovering interpretable concepts with DAGMA-style differentiable structure learning to recover a directed acyclic graph (DAG) representing causal relations among those concepts.

A key contribution is the Causal Fidelity Score (CFS), which measures whether interventions guided by the learned graph produce stronger downstream effects than random interventions. This metric aims to test whether the discovered structure actually reflects causal influence rather than simple correlations.

Experiments were conducted on ARC-Challenge, StrategyQA, and LogiQA using GPT-2 Medium. Across five random seeds (15 paired runs), CCG achieved CFS = 5.654 ± 0.625, significantly outperforming baselines such as ROME-style causal tracing, SAE-only ranking, and random interventions (all with p < 0.0001 after Bonferroni correction).

The learned graphs were sparse (≈5–6% edge density), domain-specific, and stable across seeds, suggesting the approach can recover consistent causal structures.

Discussion point:
While the results indicate strong causal alignment, an open question is whether CFS truly captures causal correctness or mainly reflects intervention sensitivity. Future work might test this method on larger models or cross-domain tasks to verify generalization and interpretability.

librarian-bot

4 minutes ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.10377 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.10377 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.10377 in a Space README.md to link it from this page.