Papers
arxiv:2602.10622

How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning

Published on Feb 11
ยท Submitted by
Jiahao Yuan
on Feb 12
Authors:
,
,
,
,
,
,
,
,
,

Abstract

Research investigates the impact of different attention masking strategies on user embedding quality in decoder-only language models, proposing a gradient-guided soft masking technique to improve training stability and representation quality for user behavior analysis.

AI-generated summary

Decoder-only large language models are increasingly used as behavioral encoders for user representation learning, yet the impact of attention masking on the quality of user embeddings remains underexplored. In this work, we conduct a systematic study of causal, hybrid, and bidirectional attention masks within a unified contrastive learning framework trained on large-scale real-world Alipay data that integrates long-horizon heterogeneous user behaviors. To improve training dynamics when transitioning from causal to bidirectional attention, we propose Gradient-Guided Soft Masking, a gradient-based pre-warmup applied before a linear scheduler that gradually opens future attention during optimization. Evaluated on 9 industrial user cognition benchmarks covering prediction, preference, and marketing sensitivity tasks, our approach consistently yields more stable training and higher-quality bidirectional representations compared with causal, hybrid, and scheduler-only baselines, while remaining compatible with decoder pretraining. Overall, our findings highlight the importance of masking design and training transition in adapting decoder-only LLMs for effective user representation learning. Our code is available at https://github.com/JhCircle/Deepfind-GGSM.

Community

Paper author Paper submitter
โ€ข
edited about 3 hours ago

๐ŸŽ‰ How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning

Decoder-only LLMs have demonstrated remarkable generative capabilities, but how well do they understand users when repurposed for representation learning? In our latest work, we revisit attention masking in decoder-only architectures and uncover a critical yet overlooked challenge: the instability that arises when transitioning from causal to bidirectional attention. By systematically studying masking strategies and proposing a gradient-guided soft masking approach, we aim to bridge the gap between autoregressive pretraining and high-quality user representation learning.

Paper author Paper submitter

๐ŸŽ‰ How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning

Hi everyone!

We're excited to share our latest academic work on adapting decoder-only LLMs for high-quality user representation learning, with a focus on solving training instability when transitioning from causal to bidirectional attention masking. Our paper is now publicly available, and we'd love to hear your feedback!

โœจ Key Contributions

  1. Unified Masking Strategy
    We systematically evaluate three attention masking recipes (Causal, Hybrid, Bidirectional) under a contrastive learning framework, using 9 anonymized real-world user modeling benchmarks (covering user behavior prediction, preference understanding, and intent sensitivity tasks).

  2. Critical Training Transition Insight
    We find that the transition path from causal to bidirectional attention is equally important as the final mask design: abrupt switching disrupts the model's pretrained inductive bias, leading to suboptimal performance and convergence issues.

  3. Gradient-Guided Soft Masking (GG-SM)
    We propose a two-stage training approach to mitigate instability:

    • Gradient Warm-up: Dynamically assign attention weights to future tokens during early training using gradient norms, enabling the model to prioritize informative context gradually.
    • Linear Scheduler: Smoothly transition from the gradient-calibrated soft mask to full bidirectional attention, preserving pretrained model knowledge while adapting to bidirectional modeling.

๐Ÿ’ฌ Let's Discuss!

We'd love to engage with the community on:

  1. Have you encountered training instability when adapting decoder-only LLMs for non-generative tasks like representation learning?

  2. What tradeoffs do you prioritize between autoregressive compatibility and representational completeness in user modeling?

Feel free to drop your questions, suggestions, or reproducibility feedback belowโ€”we're happy to collaborate on further improvements!

Thanks for reading! ๐Ÿš€

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.10622 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.10622 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.10622 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.