Contextual and Nonstationary Multi-armed Bandits Using the Linear Gaussian State Space Model for the Meta-Recommender System

Yusuke Miyake [1][2] and Tsunenori Mine [1] / Kyushu Univ [1]. Pepabo R&D Institute, GMO Pepabo, Inc [2].

October 1-4, 2023. The 2023 IEEE Conference on Systems, Man, and Cybernetics
Contextual and Nonstationary Multi-armed Bandits
Using the Linear Gaussian State Space Model for
the Meta-Recommender System

View Slide

1. Introduction

2. Related works

3. Proposal

4. Empirical evaluation on real data and results

5. Conclusion
2
Agenda

View Slide

1.
Introduction

View Slide

4
Background
Users
Best Recommendation

Method
Items
• Selecting an optimal recommendation method from many ones is crucial for EC site.

• The effectiveness of them cannot be known in advance.

• Continuous comparative evaluation in an actual environment is essential.

• However, it results in opportunity loss.
→
→
IUUQTJDPOTDPN

View Slide

5
Previous studies
• The reduction of opportunity loss is considered as a multi-armed bandit (MAB) problem.

• Many meta-recommendation systems (meta-RSs) that use multi-armed bandit policies
to select the efficient recommendation method have been proposed.
Users
Best Recommendation

Method
Items
Multi-armed bandit to select the best one.
IUUQTJDPOTDPN

View Slide

6
Issue in previous studies
• The MAB policy used by meta-RSs must take into account the following characteristics
of the recommendation method.
 
1. A temporal change in the effectiveness of recommendation methods.

2. The performance of a recommendation method depends on the context.

3. The performance also depends on response time.

• A policy that satisfies all three requirements based on these characteristics exists
yet.
→

View Slide

7
Proposed MAB policy for the meta-RS
Linear Gaussian state space bandits (LGSS bandits)

• It addresses all three factors causing opportunity loss in the selection of
recommendation methods.
 
1. For temporal change, effectiveness changes are modeled as LGSS model.

2. For context, a novel context-aware coefficient matrix for LGSS is designed.

3. For response time, a linear Kalman filter is used to estimate the state.

• A probability matching method of bandits is used to find best recommendation method
based on the estimated state.

View Slide

2.
Related works

View Slide

• The player must select one arm from multiple arms to maximize the reward.

• The reward is stochastically given based on the chosen arm.

• To find the optimal arm, player should balance exploitation and exploration.
 
• Two types of extended problems have been developed.

• Contextual MAB problem

• Nonstationary MAB problem
9
Multi-armed bandit (MAB) problem

View Slide

10
Basic problem setting
Arm0
Arm1
Arm2
Player
Estimated probability distribution
Probability distribution of rewards
Select
Best arm

View Slide

Contextual problem setting
11
Arm0
Arm1
Arm2
Player
Select
Context = 0
Best arm for context = 0

View Slide

12
Arm0
Arm1
Arm2
Player
Select
Context = 1
Contextual problem setting
Best arm for context = 1

View Slide

Nonstationary problem setting
13
Arm0
Arm1
Arm2
Player
Select
t = 0~ t = 100~ t = 0~ t = 100~
t = 99
Best arm at t = 99

View Slide

14
Arm0
Arm1
Arm2
Player
Select
t = 0~ t = 100~ t = 0~ t = 100~
t = 199
Nonstationary problem setting
Best arm at t = 199

View Slide

15
Meta-Recommender systems (meta-RSs)
Policies of meta-RSs

• Studies have used meta-RSs to select the best recommendation method based on
evaluation in the actual environment.

• These studies didn't employ contextual and nonstationary policy.

Existing policies

• In existing contextual and nonstationary policies, one requirement, response time,
cannot be satisfied because mechanisms to track change are time-consuming.

• By contrast, a policy that satisfies the requirement with a lightweight mechanism
cannot satisfy the other requirement, the temporal change.
→

View Slide

3.
Proposal

View Slide

• The proposed policy assumes the contextual and nonstationary reward model by the
following linear Gaussian state space (LGSS) model.
Modeling reward using LGSS model
17
AAADi3ichVFNa9RQFL1p1NbR2qluBDfBoWGkYXhpi0pxoCiCK5lOO22xacNL5nXm0ZcPk5eBGvIHXAsuREHBhbj3D7jxD7joTxCXFdy48CaT0trB6QvJu/fcc+494Tqh4LEk5FCZUC9cvDQ5dbly5er0tZnq7PWNOEgil3XcQATRlkNjJrjPOpJLwbbCiFHPEWzT2X+U1zcHLIp54K/Lg5DteLTn8z3uUomQPatMRrbUm89saTlealER9mlmy3mLhTEXyJCGpmsnmRVzz/Ko7LtUpE+zOjEQ6Xl0d8FOj1nZHUO3rMrphqmcNzO9uT46RlJbto+n5Mn4CZJi95z7PKFdTTZNw+oGMjYsSZN8ZgHrOeHUHPNMT7tbL8pegjWjZZto2K7WSIMURxsNzDKoQXlaQfULWNCFAFxIwAMGPkiMBVCI8dkGEwiEiO1AiliEES/qDDKooDZBFkMGRXQfvz3MtkvUxzzvGRdqF6cIfCNUajBHvpNP5Ih8I5/JD/Lnv73Sokfu5QBvZ6hloT3z8uba73NVHt4S+ieqsZ4l7MH9witH72GB5H/hDvWDF6+P1pbbc6lOPpCf6P89OSRf8Q/8wS/34yprvxnjx0EvGa7HPLuM0WBjoWHebSytLtVWHpaLmoJbcBvquI17sAJPoAUdcBVfeaW8Vd6p0+qiuqw+GFInlFJzA/456uO/GPDuaw==
rt = Zt↵t + ✏t, ✏t
⇠ N(0, 2
✏
),
↵t+1 = Tt↵t + ⌘tRt, ⌘t
⇠ N(0, 2
⌘
), t = 1, . . . , ⌧
↵1
⇠ Nd(µ1, P1),
• It represents that state follows the previous state and reward follows state .
αt
αt−1
rt
αt
Rt t + 1 Zt
Tt
ηt
ϵt
rt
αt
αt+1
+
+

View Slide

• The proposed policy assumes the contextual and nonstationary reward model by the
following linear Gaussian state space (LGSS) model.
Modeling reward using LGSS model
18
AAADi3ichVFNa9RQFL1p1NbR2qluBDfBoWGkYXhpi0pxoCiCK5lOO22xacNL5nXm0ZcPk5eBGvIHXAsuREHBhbj3D7jxD7joTxCXFdy48CaT0trB6QvJu/fcc+494Tqh4LEk5FCZUC9cvDQ5dbly5er0tZnq7PWNOEgil3XcQATRlkNjJrjPOpJLwbbCiFHPEWzT2X+U1zcHLIp54K/Lg5DteLTn8z3uUomQPatMRrbUm89saTlealER9mlmy3mLhTEXyJCGpmsnmRVzz/Ko7LtUpE+zOjEQ6Xl0d8FOj1nZHUO3rMrphqmcNzO9uT46RlJbto+n5Mn4CZJi95z7PKFdTTZNw+oGMjYsSZN8ZgHrOeHUHPNMT7tbL8pegjWjZZto2K7WSIMURxsNzDKoQXlaQfULWNCFAFxIwAMGPkiMBVCI8dkGEwiEiO1AiliEES/qDDKooDZBFkMGRXQfvz3MtkvUxzzvGRdqF6cIfCNUajBHvpNP5Ih8I5/JD/Lnv73Sokfu5QBvZ6hloT3z8uba73NVHt4S+ieqsZ4l7MH9witH72GB5H/hDvWDF6+P1pbbc6lOPpCf6P89OSRf8Q/8wS/34yprvxnjx0EvGa7HPLuM0WBjoWHebSytLtVWHpaLmoJbcBvquI17sAJPoAUdcBVfeaW8Vd6p0+qiuqw+GFInlFJzA/456uO/GPDuaw==
rt = Zt↵t + ✏t, ✏t
⇠ N(0, 2
✏
),
↵t+1 = Tt↵t + ⌘tRt, ⌘t
⇠ N(0, 2
⌘
), t = 1, . . . , ⌧
↵1
⇠ Nd(µ1, P1),
Advantages

• The model can naturally represent the temporal change of the state.

• The design of the coefficient matrix ( ) provides flexible representations for the
contextual and nonstationary problem setting.

• A lightweight estimator algorithm does not affect response time.
Z, T, R

View Slide

Context-aware coefficient matrices
 
for the LGSS model

View Slide

• The designed matrices handle time-varying contexts in the LGSS model.

• These matrices vary according to the context at each time point .
xt
t
20
Context-aware coefficient matrix
AAACvXichVHLSuRAFD2dcXz0+OgZN4KbYKPMQppqEWcQhhHd6K67tVW0NSSx1GDlQVLdo4b8gD/gwo0KLsS9PyCIO9248BMGlwpuXHg7HRAV9YZKnXvqnlunuIYnrEAydpNSvjR9bW5pbUt/a+/o7Mp8/zEbuFXf5GXTFa4/b+gBF5bDy9KSgs97PtdtQ/A5Y2Oifj5X435guc6M3PL4kq2vOdaqZeqSKC0zuqBJ9Y9aMexwM9LkcliRrhcNqqUX9KA608htXa6buginIs3WMlmWY3Gob0E+AVkkUXAzp6hgBS5MVGGDw4EkLKAjoG8ReTB4xC0hJM4nZMXnHBHSpK1SFacKndgN+q9RtpiwDuX1nkGsNukWQcsnpYp+ds2O2R27YCfsP3t8t1cY96h72aLdaGi5p3Xt9Ew/fKqyaZdYf1Z96FliFb9jrxZ592Km/gqzoa9t795Nj5b6wwF2yG7J/wG7YWf0Aqd2bx4VeWnvAz8GeYloPPnXw3gLZody+ZHccHE4OzaeDKoVvejDT5rGL4xhEgWUqfs+znGJK+WvwhWhOI1SJZVouvEilH9P00imgg==
Zt = x>
t
, Rt = xt, Tt = Im
AAACm3icSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxCTlVlfUxpcoxGTmKcRUKxjoKBjG1MblxgsoG+gZgIECJsMQylBmgIKAfIFtDDEMKQz5DMkMpQy5DKkMeQwlQHYOQyJDMRBGMxgyGDAUAMViGaqBYkVAViZYPpWhloELqLcUqCoVqCIRKJoNJNOBvGioaB6QDzKzGKw7GWhLDhAXAXUqMKgaXDVYafDZ4ITBaoOXBn9wmlUNNgPklkognQTRm1oQz98lEfydoK5cIF3CkIHQhdfNJQxpDBZgt2YC3V4AFgH5Ihmiv6xq+udgqyDVajWDRQavge5faHDT4DDQB3llX5KXBqYGzcbjniSgW2qB0WOIHhmYjDAjPUMzPZNAE2UHJ2hEcTBIMygxaABjw5zBgcGDIYAhFGh6HcMShrUM65hkmZyZvJh8IEqZGKF6hBlQAFMoAGZ8mKs=
xt
2 {0, 1}m

View Slide

• The equation is consistent with the reward formulation in contextual MAB.

• It enables the estimated state to be used to solve the MAB.
αt
= θt


• When these matrices are applied to the model:
xt
t
21
Zt = x>
t
, Rt = xt, Tt = Im
xt
2 {0, 1}m
AAADXHichVFNa9RQFL2ZVK1Ta0cFEdwEhw6VhuGlFBWhWHTjSvrhtIW+NrxkXmdCXz5MbgZryB/wD7hwpeBC3PsH3PgHXMzGvbhsoRsX3iRTWi2OL+S9e887597zuE6kvAQZG2o1feLCxUuTl+tTV6avzjSuXd9IwjR2ZccNVRhvOSKRygtkBz1UciuKpfAdJTed/SfF/eZAxokXBs/xIJI7vugF3p7nCiTIbgxjG1tL3PGzl7mNuxnHMMqLlAsV9QVh81xGiaeIjKbRMk4znng+9wX2XaGyZ/kcMwnp+WJ3wc5OWPlds8V5/WzBDOetvGp5tgcKG40TH1WjAhrfBAU1KLgvUtE1cMkyeTfExOQoUrvRZG1WLuN8YI2CJozWStj4DBy6EIILKfggIQCkWIGAhL5tsIBBRNgOZITFFHnlvYQc6qRNiSWJIQjdp71H2fYIDSgvaial2qUuiv6YlAbMsm/sIztkX9kn9oP9+metrKxReDmg06m0MrJnXt9aP/6vyqcToX+qGusZYQ8elF498h6VSPEKt9IPXr05XH+4Npu12Hv2k/y/Y0P2hV4QDI7cD6ty7e0YPw55yWk81t/DOB9sLLSte+3F1cXm8uPRoCbhNtyBOZrGfViGp7ACHXC1R5rUAi2sfdcn9Cl9uqLWtJHmBvyx9Ju/AaLC4A0=
rt = x>
t
↵t + ✏t, ✏t
⇠ N(0, 2
✏
),
↵t+1 = ↵t + ⌘txt, ⌘t
⇠ N(0, 2
⌘
), t = 1, . . . , ⌧

View Slide

• The error is added only to the element in the corresponding state of the context .

• The design enables the modeling of state transitions according to the frequency of
occurrence of context elements, which is expected to stabilize the accuracy of predictions.
ηt
αt
xt


• When these matrices are applied to the model:
xt
t
22
Zt = x>
t
, Rt = xt, Tt = Im
xt
2 {0, 1}m
AAADXHichVFNa9RQFL2ZVK1Ta0cFEdwEhw6VhuGlFBWhWHTjSvrhtIW+NrxkXmdCXz5MbgZryB/wD7hwpeBC3PsH3PgHXMzGvbhsoRsX3iRTWi2OL+S9e887597zuE6kvAQZG2o1feLCxUuTl+tTV6avzjSuXd9IwjR2ZccNVRhvOSKRygtkBz1UciuKpfAdJTed/SfF/eZAxokXBs/xIJI7vugF3p7nCiTIbgxjG1tL3PGzl7mNuxnHMMqLlAsV9QVh81xGiaeIjKbRMk4znng+9wX2XaGyZ/kcMwnp+WJ3wc5OWPlds8V5/WzBDOetvGp5tgcKG40TH1WjAhrfBAU1KLgvUtE1cMkyeTfExOQoUrvRZG1WLuN8YI2CJozWStj4DBy6EIILKfggIQCkWIGAhL5tsIBBRNgOZITFFHnlvYQc6qRNiSWJIQjdp71H2fYIDSgvaial2qUuiv6YlAbMsm/sIztkX9kn9oP9+metrKxReDmg06m0MrJnXt9aP/6vyqcToX+qGusZYQ8elF498h6VSPEKt9IPXr05XH+4Npu12Hv2k/y/Y0P2hV4QDI7cD6ty7e0YPw55yWk81t/DOB9sLLSte+3F1cXm8uPRoCbhNtyBOZrGfViGp7ACHXC1R5rUAi2sfdcn9Cl9uqLWtJHmBvyx9Ju/AaLC4A0=
rt = x>
t
↵t + ✏t, ✏t
⇠ N(0, 2
✏
),
↵t+1 = ↵t + ⌘txt, ⌘t
⇠ N(0, 2
⌘
), t = 1, . . . , ⌧

View Slide

23
• The design is based on the structural time-series model using an LGSS model.

• For example, if a trend can be assumed for the state transition, it can be extended to
the following matrix:
AAADunichVHNbtNAEB7X/JTy0xQuSFxWRI1aKYrWqCoIgVTBBW5tmrQVcWrZm3Wyite27E3UYPkFeAEOiANIHBB3XqAXDnDk0EdAHIvEhQNjx2lVStqxvDs7M98332ic0BOxovRAm9EvXLx0efbK3NVr12/MlxZubsXBIGK8yQIviHYcO+ae8HlTCeXxnTDitnQ8vu30n2b57SGPYhH4DTUKeVvaXV+4gtkKQ9aCtvrCUuQxWTIdmeyllto1pa16sZs00irJfcdNaLqbGMRUQvKYyHS5Suo5yvS4q1rEdHhX+IkdRfYoTRhLyYSNVE5wyAmHgSXm1FTlGG9yv1MQEzMS3Z5qV0njnOYZLbO95HlqyYxM8T2VdITdTY/GXJ7aX6YT0UcU/xFhlcq0RnMjpx2jcMpQ2HpQ+gwmdCAABgOQwMEHhb4HNsT4tcAACiHG2pBgLEJP5HkOKcwhdoBVHCtsjPbx7OKrVUR9fGeccY5m2MXDP0IkgUX6nX6kh/QL/UR/0D9TuZKcI9MywtsZY3lozb+6vfn7XJTEW0HvGHWmZgUuPMi1CtQe5pFsCjbGD1++Ptx8WF9MKvQ9/Yn639EDuo8T+MNf7MMGr785Q4+DWlJcj/HvMk47W/dqxmptZWOlvPakWNQs3IG7sITbuA9r8AzWoQlMe6vta1+1b/oj3dGF3h+XzmgF5hacMF39Bf56A1I=
Zt = (xT
t
, 01⇥m), Rt =

xt 0m⇥1
0m⇥1 xt
, Tt =

Im diag(xt)
0m⇥m Im
xt
2 {0, 1}m
AAACpXichVG7SgNBFD2u73fURrAJBh9VuBFREQTRxkbwlSgkYdldR7O4L3YnAQ3pxR+wsFKwEMFS7W38AQs/QSwVbCy8u1kQFfUOM3PmzD13znB1zzIDSfTYoDQ2Nbe0trV3dHZ19/Qm+vpzgVv2DZE1XMv1t3QtEJbpiKw0pSW2PF9otm6JTX1vMbzfrAg/MF1nQ+57omhru465YxqaZEpNDBd0u1rQLK+k1VSZp1m7mJxLRqQsCRmSaiJFaYoi+RNkYpBCHCtu4hoFbMOFgTJsCDiQjC1oCHjkkQHBY66IKnM+IzO6F6ihg7VlzhKcoTG7x+sun/Ix6/A5rBlEaoNfsXj6rExihB7ogl7oni7pid5/rVWNaoRe9nnX61rhqb1Hg+tv/6ps3iVKn6o/PUvsYCbyarJ3L2LCXxh1feXg+GV9dm2kOkpn9Mz+T+mR7vgHTuXVOF8Vayd/+NHZS43bk/nejJ8gN5HOTKUnVydT8wtxo9owhGGMczemMY8lrCDL1Q9xhRvcKmPKsrKh5OqpSkOsGcCXUNQPIGmdsg==
↵t[0 : m] = ✓t

View Slide

MAB policy
 
using the Kalman filter

View Slide

Estimation state using linear Kalman filter
25
AAAC6nichVFNSxxBEH2OMdFV40YvQi7iogSUpUeWJAQCklxyXNddFRwdesbetXG+mOld2IzzB3LMJQRPCckh5J4/kEsuHj34E8TjCl4MpGZ2IH6g1tDTr1/Vq35NWYEjI8XY8YA2+GDo4aPhkcLo2PjjieKTybXIb4e2aNi+44cbFo+EIz3RUFI5YiMIBXctR6xbe2/T/HpHhJH0vbrqBmLL5S1PNqXNFVFmsWVYbmy47cSM1YKezL+um+oSta+SRcMoVC9lq32a0LbhcrUbNeN6smBEsuXy7SUzNoTiSc1UtSsFZrHEyiyLmZtAz0EJeVT94i8Y2IEPG224EPCgCDvgiOjbhA6GgLgtxMSFhGSWF0hQIG2bqgRVcGL36N+i02bOenROe0aZ2qZbHFohKWcwx47YD9Zjf9hPdsIubu0VZz1SL13arb5WBObEh+nV83tVLu0Ku/9Vd3pWaOJl5lWS9yBj0lfYfX3n/afe6qvaXDzPvrJT8v+FHbPf9AKvc2Z/XxG1gzv8WOQlHY9+fRg3wdpSWX9erqxUSstv8kEN4ylm8Yym8QLLeIcqGtT9ED1c4K/maB+1z9pBv1QbyDVTuBLat3985rtY
µt+1 = Ttµt|t
,
Pt+1 = TtPt|t
TT
t
+ 2
⌘
RtRT
t
AAADBXichVG7SsRQEJ3E9/patVCwCS6KIrvciKgIgiiIsM36WBWNhiTe1WBeJHcXNKaxEfwBCysFCxE7sbWw8Qcs/APFUsHGwkk24At1QnLPnJkzOZdRHUP3GCH3HF9RWVVdU1uXqG9obGpOtrQueHbR1Whesw3bXVIVjxq6RfNMZwZdclyqmKpBF9WtybC+WKKup9vWPNt26KqpbFh6QdcUhpSc3JNU05fMouyzXRYEQs+YEDOBzIR+IRtmJcSSlPhe683JbFlma5KpsE2v4M8HU5j5aTHo+6TKlUeHk7FfSAtZ7Mp+EsnJFMmQKISfQIxBCuLI2ckrkGAdbNCgCCZQsIAhNkABD58VEIGAg9wq+Mi5iPSoTiGABGqL2EWxQ0F2C78bmK3ErIV5ONOL1Br+xcDXRaUA3eSOnJFnckvOySN5+3WWH80IvWzjqZa11JGbDzrmXv9VmXgy2PxQ/emZQQFGIq86enciJryFVtaXdg6f50Znu/0eckKe0P8xuSc3eAOr9KKdztDZoz/8qOglXI/4fRk/wcJARhzKDM4MpsYn4kXVQid0QS9uYxjGYRpykMfpD1wT18518Pv8BX/JX5VbeS7WtMGX4K/fAYilwD0=
µt|t
= µt + Kvt
= µt + (PtZT
t
F 1
t
)vt
Pt|t
= Pt KFtKT
AAACnHichVHLSsNAFD3G97vqRhGkWCquylRERRBEXQgiaGsf2EpI4qjBvEimBQ3FvT/gwpWCCxF05w+48Qdc9BPEZQU3LrxNA6Ki3jCZc8/cc+cMV3UM3ROMVZuk5pbWtvaOzq7unt6+/sjAYNazS67GM5pt2G5eVTxu6BbPCF0YPO+4XDFVg+fUw+X6ea7MXU+3rS1x5PAdU9m39D1dUwRRcmSkeKAI363IIroQ3ZZFUTX9olmiXI7EWIIFEf0JkiGIIYwNO3KPInZhQ0MJJjgsCMIGFHj0FZAEg0PcDnziXEJ6cM5RQRdpS1TFqUIh9pD++5QVQtaivN7TC9Qa3WLQckkZRZw9sWtWY4/shj2z9197+UGPupcj2tWGljty/+lw+u1flUm7wMGn6k/PAnuYC7zq5N0JmPortIa+fHxWS8+n4v4Eu2Qv5P+CVdkDvcAqv2pXmzx1/ocflbxUaDzJ78P4CbJTieRMYnpzOra4FA6qA6MYxyRNYxaLWMUGMtT9BFe4xZ00Jq1Ia9J6o1RqCjVD+BJS9gPesZpA
ˆ
rt = Ztµt
AAACznichVHLShxBFD22JjFjEkfdBNwMGZRAyFAjEkNAEAMiuBkfoxJbm+q2ZqawX3TVNGjTZBvyAy6ySiALcZVNfiCb/EAQd7oUlwayySK3expCIjG3qKpTp+65dYprh65UmrHTPqN/4NbtO4N3S0P37j8YLo+MrqugGzmi6QRuEG3aXAlX+qKppXbFZhgJ7tmu2LD3Xmb3G7GIlAz8Nb0fim2Pt33Zkg7XRFnlJdP2kji1dGVythJZ+qnZ4TqJiDDN0kKPfmXphqVp3TE9rjuqlaylT0wl2x7fmbISU4RKuoGfWuUqq7E8KtdBvQBVFNEIyp9hYhcBHHThQcCHJuyCQ9HYQh0MIXHbSIiLCMn8XiBFibRdyhKUwYndo7VNp62C9emc1VS52qFXXJoRKSuYYN/YEbtiX9kxu2A//1kryWtkXvZpt3taEVrDbx+u/vivyqNdo/NbdaNnjRae514leQ9zJvuF09PHB4dXqy9WJpJJ9oFdkv/37JR9oR/48Xfn47JYeXeDH5u8ZO2p/92M62B9qlZ/Vptenq7OzReNGsQ4HuExdWMGc1hEA02q/gknOMO50TBiIzVe91KNvkIzhj/CePMLz5WuMA==
vt = rt ˆ
rt
Ft = ZtPtZT
t
+ 2
✏
Filtering
One-step ahead
prediction
Prediction of observation Error of observation
• The Kalman filter continuously estimates the state by repeating one-step ahead
prediction and filtering.

• State follows a multivariate normal distribution with mean variance-covariance .
α
𝒩
m
μ P
Cycle of state updates in linear Kalman filter
t = t+1

View Slide

Arm selection by probability matching method
26
Probability matching method using the state estimates from the Kalman filter.

1. The mean variance-covariance of the state are estimated for each arm.

2. The parameters for each arm are sampled from with the estimators.

3. The arm with the largest inner product between the sampled parameter and
context is selected.
μt
Pt
αt
( = θt
)
˜
θ(k)
t
k
𝒩
m
˜
θ(k)
t
xt

View Slide

Virtual exploration
 
using missing observations

View Slide

Estimation state using linear Kalman
fi
lter (again)
28
µt+1 = Ttµt|t
,
Pt+1 = TtPt|t
TT
t
+ 2
⌘
RtRT
t
AAADBXichVG7SsRQEJ3E9/patVCwCS6KIrvciKgIgiiIsM36WBWNhiTe1WBeJHcXNKaxEfwBCysFCxE7sbWw8Qcs/APFUsHGwkk24At1QnLPnJkzOZdRHUP3GCH3HF9RWVVdU1uXqG9obGpOtrQueHbR1Whesw3bXVIVjxq6RfNMZwZdclyqmKpBF9WtybC+WKKup9vWPNt26KqpbFh6QdcUhpSc3JNU05fMouyzXRYEQs+YEDOBzIR+IRtmJcSSlPhe683JbFlma5KpsE2v4M8HU5j5aTHo+6TKlUeHk7FfSAtZ7Mp+EsnJFMmQKISfQIxBCuLI2ckrkGAdbNCgCCZQsIAhNkABD58VEIGAg9wq+Mi5iPSoTiGABGqL2EWxQ0F2C78bmK3ErIV5ONOL1Br+xcDXRaUA3eSOnJFnckvOySN5+3WWH80IvWzjqZa11JGbDzrmXv9VmXgy2PxQ/emZQQFGIq86enciJryFVtaXdg6f50Znu/0eckKe0P8xuSc3eAOr9KKdztDZoz/8qOglXI/4fRk/wcJARhzKDM4MpsYn4kXVQid0QS9uYxjGYRpykMfpD1wT18518Pv8BX/JX5VbeS7WtMGX4K/fAYilwD0=
µt|t
= µt + Kvt
= µt + (PtZT
t
F 1
t
)vt
Pt|t
= Pt KFtKT
AAACnHichVHLSsNAFD3G97vqRhGkWCquylRERRBEXQgiaGsf2EpI4qjBvEimBQ3FvT/gwpWCCxF05w+48Qdc9BPEZQU3LrxNA6Ki3jCZc8/cc+cMV3UM3ROMVZuk5pbWtvaOzq7unt6+/sjAYNazS67GM5pt2G5eVTxu6BbPCF0YPO+4XDFVg+fUw+X6ea7MXU+3rS1x5PAdU9m39D1dUwRRcmSkeKAI363IIroQ3ZZFUTX9olmiXI7EWIIFEf0JkiGIIYwNO3KPInZhQ0MJJjgsCMIGFHj0FZAEg0PcDnziXEJ6cM5RQRdpS1TFqUIh9pD++5QVQtaivN7TC9Qa3WLQckkZRZw9sWtWY4/shj2z9197+UGPupcj2tWGljty/+lw+u1flUm7wMGn6k/PAnuYC7zq5N0JmPortIa+fHxWS8+n4v4Eu2Qv5P+CVdkDvcAqv2pXmzx1/ocflbxUaDzJ78P4CbJTieRMYnpzOra4FA6qA6MYxyRNYxaLWMUGMtT9BFe4xZ00Jq1Ia9J6o1RqCjVD+BJS9gPesZpA
ˆ
rt = Ztµt
AAACznichVHLShxBFD22JjFjEkfdBNwMGZRAyFAjEkNAEAMiuBkfoxJbm+q2ZqawX3TVNGjTZBvyAy6ySiALcZVNfiCb/EAQd7oUlwayySK3expCIjG3qKpTp+65dYprh65UmrHTPqN/4NbtO4N3S0P37j8YLo+MrqugGzmi6QRuEG3aXAlX+qKppXbFZhgJ7tmu2LD3Xmb3G7GIlAz8Nb0fim2Pt33Zkg7XRFnlJdP2kji1dGVythJZ+qnZ4TqJiDDN0kKPfmXphqVp3TE9rjuqlaylT0wl2x7fmbISU4RKuoGfWuUqq7E8KtdBvQBVFNEIyp9hYhcBHHThQcCHJuyCQ9HYQh0MIXHbSIiLCMn8XiBFibRdyhKUwYndo7VNp62C9emc1VS52qFXXJoRKSuYYN/YEbtiX9kxu2A//1kryWtkXvZpt3taEVrDbx+u/vivyqNdo/NbdaNnjRae514leQ9zJvuF09PHB4dXqy9WJpJJ9oFdkv/37JR9oR/48Xfn47JYeXeDH5u8ZO2p/92M62B9qlZ/Vptenq7OzReNGsQ4HuExdWMGc1hEA02q/gknOMO50TBiIzVe91KNvkIzhj/CePMLz5WuMA==
vt = rt ˆ
rt
Ft = ZtPtZT
t
+ 2
✏
Filtering reduces
the values of P
One-step ahead
prediction
increases the
values of P
Prediction of observation Error of observation
• The Kalman filter continuously estimates the state by repeating one-step ahead
prediction and filtering.

• State follows a multivariate normal distribution with mean variance-covariance .
α
𝒩
m
μ P
Cycle of state updates in linear Kalman filter
t = t+1

View Slide

Virtual exploration using missing observations
29
µt+1 = Ttµt|t
,
Pt+1 = TtPt|t
TT
t
+ 2
⌘
RtRT
t
One-step ahead
prediction
increases the
values of P
• The Kalman filter can handle missing observations appropriately.

• The proposed policy incorporates this process as an update operation for the evaluation
of the unselected arm in the MAB problem.
Cycle of processing the missing observations in linear Kalman filter
t = t+1
AAACuHichVG7SsRAFD3G9/rYVRvBJrgoVsus+EIQRBvL9bEqGIlJHNfBvEhmFzTsD/gDFlYKFmJvLzaitYWfIJYKNhbezQZFRb1hMmfOvefOGa7p2yKUjD00KI1NzS2tbe2pjs6u7nSmp3c19MqBxYuWZ3vBummE3BYuL0ohbb7uB9xwTJuvmXvztfxahQeh8NwVue/zTccouWJHWIYkSs+Ma6YTaU65qkdSc8S2Kqvq8MwHKTUtVfiSUgu6VPVMluVYHOpPkE9AFkkUvMwlNGzDg4UyHHC4kIRtGAjp20AeDD5xm4iICwiJOM9RRYq0ZariVGEQu0f/Ep02Etalc61nGKstusWmFZBSxRC7Z+fsmd2wC/bI3n7tFcU9al72aTfrWu7r6cP+5dd/VQ7tErufqj89S+xgKvYqyLsfM7VXWHV95eDoeXl6aSgaZqfsifyfsAd2TS9wKy/W2SJfOv7Dj0leqjSe/Pdh/ASro7n8RG5scSw7O5cMqg0DGMQITWMSs1hAAUXqfowr3OJOmVa2lJIi6qVKQ6Lpw5dQgndBlqSe
µt|t
= µt
Pt|t
= Pt
Filtering keeps
the values of P

View Slide

Virtual exploration using missing observations
30
µt+1 = Ttµt|t
,
Pt+1 = TtPt|t
TT
t
+ 2
⌘
RtRT
t
One-step ahead
prediction
increases the
values of P
• The Kalman filter can handle missing observations appropriately.

• The proposed policy incorporates this process as an update operation for the evaluation
of the unselected arm in the MAB problem.
Cycle of processing the missing observations in linear Kalman filter
t = t+1
AAACuHichVG7SsRAFD3G9/rYVRvBJrgoVsus+EIQRBvL9bEqGIlJHNfBvEhmFzTsD/gDFlYKFmJvLzaitYWfIJYKNhbezQZFRb1hMmfOvefOGa7p2yKUjD00KI1NzS2tbe2pjs6u7nSmp3c19MqBxYuWZ3vBummE3BYuL0ohbb7uB9xwTJuvmXvztfxahQeh8NwVue/zTccouWJHWIYkSs+Ma6YTaU65qkdSc8S2Kqvq8MwHKTUtVfiSUgu6VPVMluVYHOpPkE9AFkkUvMwlNGzDg4UyHHC4kIRtGAjp20AeDD5xm4iICwiJOM9RRYq0ZariVGEQu0f/Ep02Etalc61nGKstusWmFZBSxRC7Z+fsmd2wC/bI3n7tFcU9al72aTfrWu7r6cP+5dd/VQ7tErufqj89S+xgKvYqyLsfM7VXWHV95eDoeXl6aSgaZqfsifyfsAd2TS9wKy/W2SJfOv7Dj0leqjSe/Pdh/ASro7n8RG5scSw7O5cMqg0DGMQITWMSs1hAAUXqfowr3OJOmVa2lJIi6qVKQ6Lpw5dQgndBlqSe
µt|t
= µt
Pt|t
= Pt
By introducing this behavior into the policy, the exploration for the unselected arm
can be encouraged by the mechanism of the probability matching method.
Filtering keeps
the values of P

View Slide

4.
Empirical evaluation on real data

And results

View Slide

32
Evaluation setting
• The data were collected from an actual EC site.

• Each product is assigned to one of 18 categories.

• Four recommendation methods are used.

• Counts of the recommendations of each method and the clicks for each recommendation
in each product category hourly from June 20, 2019, through July 22, 2019.

• The total number of recommendations is 1,488,626.
0 100 200 300 400 500 600
Hours
0.05
0.10
0.15
Click-through rate
Browsing path
Demographic
LLR
Similar image
The highest CTR switches.

Typical characteristics of the
recommendation method appears.
Fig. 2. Change of CTR for each recommendation method.

View Slide

33
Evaluation setting
• We simulated the number of clicks obtained by the selected recommendation
method during the time that the data were acquired.

• Assumptions

• The user clicks for recommendations follow the Bernoulli distribution of the CTR,
where recommendation methods are selected by the MAB policy.

• Each recommendation method has an 18-dimensional parameter equal to the
number of product categories.

• The CTR is calculated by the inner product of and context .

• The context is represented as a 1-hot vector of the product category that the user
is browsing at time .
˜
θ(k)
t
˜
θ(k)
t
xt
xt
t

View Slide

Effect of context and temporal change
34
0
5000
10000
15000
20000
Cumulative regret
adaptive thompson sampling
decay linucb
dynamic linucb
LGSS bandits
linear thompson sampling
linucb
neural linear
select best arm at ﬁrst
select random arm
0 100 200 300 400 500 600
0.0
0.2
0.4
0.6
0.8
1.0
Best arm rate
• The proposed LGSS bandits achieved the highest cumulative reward, 201,330.4.

• LGSS bandits and adaptive Thompson sampling reduce overall opportunity loss, not
only during best arm switching but also during stationary periods.

View Slide

Effect of context and temporal change
35
0.00
0.05
0.10
0.15
0.20
Change of click-through rate (Plants)
Browsing path
Demographic
LLR
Similar image
0 100 200 300 400 500 600
0.0
0.2
0.4
0.6
0.8
1.0
Best arm rate (Plants)
dynamic linucb
LGSS bandits
0.00
0.05
0.10
0.15
0.20
Change of click-through rate (Stationery)
Browsing path
Demographic
LLR
Similar image
0 100 200 300 400 500 600
0.0
0.2
0.4
0.6
0.8
1.0
Best arm rate (Stationery)
dynamic linucb
LGSS bandits
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
1.0
Period 1
240 260 280 300
Period 2
300 400 500 600
Period 3
• Comparison of the ratio of optimal arms in categories where the recommendation
method with the highest CTR switches and categories where it does not.

• The proposed policy has been shown to continuously select the best arm in both
situations.

View Slide

Effect of the response time
36
0.0
0.2
0.4
ms
0.48 0.19 0.04 0.24 0.29 0.04 0.27
Select per 1 time
10°1
100
sec (Log)
1.49 0.04 0.06 0.09 0.04 0.05 1.82
Update per 2665 times
decay linucb
dynamic linucb
LGSS bandits
linucb
neural linear
Elapsed time for 2665 times for 18 dimension 4 arms
• We measured the time elapsed from the arm
selection and evaluation updates for each policy.

• For arm selection, the LGSS bandits had an
elapsed time of 0.24 ms.

• All measures were less than 0.5 ms and
considered to have sufficient performance
without significant impact.

• Measures of evaluation updates, the elapsed time
was sufficiently short compared with the unit time
(1 h).

• Some policy are not scalable to an increase in
the number of trials per unit time.

View Slide

5.
Conclusion

View Slide

• We proposed an LGSS-based MAB policy to optimize the selection of
recommendation methods continuously and quickly according to the context and
temporal change using the Kalman filter for meta-RSs.

• The experimental results demonstrated that the proposed policy can effectively reduce
opportunity loss during evaluation and increase the cumulative number of clicks
compared with baseline policies.

• The results also revealed that the proposed policy has no overhead on the response
time of the meta-RS.

• In the future, we will develop general-purpose policies that can quickly handle
nonlinearity.
38
Conclusion

View Slide

View Slide

Contextual and Nonstationary Multi-armed Bandits Using the Linear Gaussian State Space Model for the Meta-Recommender System

Contextual and Nonstationary Multi-armed Bandits Using the Linear Gaussian State Space Model for the Meta-Recommender System

monochromegane

More Decks by monochromegane

Other Decks in Research

Featured

Transcript