arxiv:2602.10622

How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning

Published on Feb 11

· Submitted by

Jiahao Yuan on Feb 12

Ant Group

Upvote

Authors:

Jiahao Yuan ,

Abstract

Research investigates the impact of different attention masking strategies on user embedding quality in decoder-only language models, proposing a gradient-guided soft masking technique to improve training stability and representation quality for user behavior analysis.

AI-generated summary

Decoder-only large language models are increasingly used as behavioral encoders for user representation learning, yet the impact of attention masking on the quality of user embeddings remains underexplored. In this work, we conduct a systematic study of causal, hybrid, and bidirectional attention masks within a unified contrastive learning framework trained on large-scale real-world Alipay data that integrates long-horizon heterogeneous user behaviors. To improve training dynamics when transitioning from causal to bidirectional attention, we propose Gradient-Guided Soft Masking, a gradient-based pre-warmup applied before a linear scheduler that gradually opens future attention during optimization. Evaluated on 9 industrial user cognition benchmarks covering prediction, preference, and marketing sensitivity tasks, our approach consistently yields more stable training and higher-quality bidirectional representations compared with causal, hybrid, and scheduler-only baselines, while remaining compatible with decoder pretraining. Overall, our findings highlight the importance of masking design and training transition in adapting decoder-only LLMs for effective user representation learning. Our code is available at https://github.com/JhCircle/Deepfind-GGSM.

View arXiv page View PDF Add to collection

Community

Jhcircle

Paper author Paper submitter about 11 hours ago

•

edited about 3 hours ago

🎉 How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning

Decoder-only LLMs have demonstrated remarkable generative capabilities, but how well do they understand users when repurposed for representation learning? In our latest work, we revisit attention masking in decoder-only architectures and uncover a critical yet overlooked challenge: the instability that arises when transitioning from causal to bidirectional attention. By systematically studying masking strategies and proposing a gradient-guided soft masking approach, we aim to bridge the gap between autoregressive pretraining and high-quality user representation learning.

Jhcircle

Paper author Paper submitter about 3 hours ago

🎉 How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning

Hi everyone!

We're excited to share our latest academic work on adapting decoder-only LLMs for high-quality user representation learning, with a focus on solving training instability when transitioning from causal to bidirectional attention masking. Our paper is now publicly available, and we'd love to hear your feedback!

✨ Key Contributions

Unified Masking Strategy
We systematically evaluate three attention masking recipes (Causal, Hybrid, Bidirectional) under a contrastive learning framework, using 9 anonymized real-world user modeling benchmarks (covering user behavior prediction, preference understanding, and intent sensitivity tasks).
Critical Training Transition Insight
We find that the transition path from causal to bidirectional attention is equally important as the final mask design: abrupt switching disrupts the model's pretrained inductive bias, leading to suboptimal performance and convergence issues.
Gradient-Guided Soft Masking (GG-SM)
We propose a two-stage training approach to mitigate instability:
- Gradient Warm-up: Dynamically assign attention weights to future tokens during early training using gradient norms, enabling the model to prioritize informative context gradually.
- Linear Scheduler: Smoothly transition from the gradient-calibrated soft mask to full bidirectional attention, preserving pretrained model knowledge while adapting to bidirectional modeling.

💬 Let's Discuss!

We'd love to engage with the community on:

Have you encountered training instability when adapting decoder-only LLMs for non-generative tasks like representation learning?
What tradeoffs do you prioritize between autoregressive compatibility and representational completeness in user modeling?

Feel free to drop your questions, suggestions, or reproducibility feedback below—we're happy to collaborate on further improvements!

Thanks for reading! 🚀