Adversarial Bandits With Multi-User Delayed Feedback: Theory and Application

Yandi Li, Jianxiong Guo, Yupeng Li, Tian Wang, Weijia Jia

Research output: Contribution to journalJournal articlepeer-review

Abstract

The multi-armed bandit (MAB) models have attracted significant research attention due to their applicability and effectiveness in various real-world scenarios such as resource allocation in uncertain environments, online advertising, and dynamic pricing. As an important branch, the adversarial multi-armed bandit problems with delayed feedback have been proposed and studied by many researchers recently where a conceptual adversary strategically selects the reward distributions associated with each arm to challenge the learning algorithm and the agent experiences a bunch of delays in receiving the corresponding reward feedback from different users after taking an action on them. However, the existing models restrict the feedback to being generated from only one user, which makes models inapplicable to the prevailing scenarios of multiple users (e.g., ad recommendation for a group of users). In this paper, we consider that the delayed feedback results are from multiple users and are unrestricted on internal distribution while the feedback delay is arbitrary and unknown to the player in advance. Also, for different users in a round, the delays in feedback have no assumption of latent correlation. Thus, we formulate an adversarial multi-armed bandit problem with multi-user delayed feedback and design a modified EXP3 algorithm named MUD-EXP3, which makes a decision at each round by considering the importance-weighted estimator of the received feedback from different users. On the premise of known terminal round index TT, the number of users MM, the number of arms NN, and upper bound of delay d-{max} dmax, we prove a regret of mathcal {O}(sqrt{TM{2}ln {N}(Nmathrm{e}+4d-{max})})O(T M2lnN(Ne+4 dmax)). Furthermore, for the more common case of unknown TT, an adaptive algorithm named AMUD-EXP3 is proposed with a sublinear regret concerning TT. Finally, extensive experiments are conducted to indicate the correctness and effectiveness of our algorithms in dynamic environments.

Original languageEnglish
Pages (from-to)9383-9397
Number of pages15
JournalIEEE Transactions on Mobile Computing
Volume23
Issue number10
Early online date5 Feb 2024
DOIs
Publication statusPublished - Oct 2024

Scopus Subject Areas

  • Software
  • Computer Networks and Communications
  • Electrical and Electronic Engineering

User-Defined Keywords

  • Adversarial bandit
  • multi-user delayed feedback
  • EXP3
  • regret analysis
  • online learning
  • applications

Fingerprint

Dive into the research topics of 'Adversarial Bandits With Multi-User Delayed Feedback: Theory and Application'. Together they form a unique fingerprint.

Cite this