Adaptive Exploration for Latent-State Bandits

Jikai Jin, Kenneth Hung, Sanath Kumar Krishnamurthy, Baoyi Shi, Congshan Zhang
In Thirty-second ACM SIGKDD Conference on Knowledge Discovery and Data Mining · 2026

Abstract

We study bandits whose rewards depend on an unobserved Markov state that evolves independently of the learner’s actions. The optimal arm can change even though the learner observes only past actions and rewards. We propose algorithms that feed LinUCB with two summaries of the hidden state: a lagged action-reward pair and, when available, a probe fingerprint formed from rewards of multiple arms. The adaptive variants refresh the fingerprint using residual, margin, and staleness tests. In synthetic stress tests over state count, transition rate, noise, and horizon, these methods reduce dynamic regret relative to standard, adversarial, and non-stationary bandit baselines when the summaries distinguish states and are updated often enough. Ablations and misspecification tests identify the main failure modes: weak fingerprint separation, high noise, and state changes during sequential probes.

Cite

@article{jin2026adaptive,
  title={Adaptive Exploration for Latent-State Bandits},
  author={Jin, Jikai and Hung, Kenneth and Krishnamurthy, Sanath Kumar and Shi, Baoyi and Zhang, Congshan},
  journal={arXiv preprint arXiv:2602.05139},
  year={2026}
}