Yusuke Miyake [1][2] and Tsunenori Mine [1] / Kyushu Univ [1]. Pepabo R&D Institute, GMO Pepabo, Inc [2].
October 1-4, 2023. The 2023 IEEE Conference on Systems, Man, and Cybernetics Contextual and Nonstationary Multi-armed Bandits Using the Linear Gaussian State Space Model for the Meta-Recommender System
5 Previous studies • The reduction of opportunity loss is considered as a multi-armed bandit (MAB) problem.
• Many meta-recommendation systems (meta-RSs) that use multi-armed bandit policies to select the efficient recommendation method have been proposed. Users Best Recommendation
Method Items Multi-armed bandit to select the best one. IUUQTJDPOTDPN
Contextual problem setting 11 Arm0 Arm1 Arm2 Player Select Context = 0 Estimated probability distribution Probability distribution of rewards Best arm for context = 0
12 Arm0 Arm1 Arm2 Player Select Context = 1 Contextual problem setting Estimated probability distribution Probability distribution of rewards Best arm for context = 1
Nonstationary problem setting 13 Arm0 Arm1 Arm2 Player Select t = 0~ t = 100~ t = 0~ t = 100~ t = 99 Estimated probability distribution Probability distribution of rewards Best arm at t = 99
14 Arm0 Arm1 Arm2 Player Select t = 0~ t = 100~ t = 0~ t = 100~ t = 199 Nonstationary problem setting Estimated probability distribution Probability distribution of rewards Best arm at t = 199
15 Meta-Recommender systems (meta-RSs) Policies of meta-RSs
• Studies have used meta-RSs to select the best recommendation method based on evaluation in the actual environment.
• These studies didn't employ contextual and nonstationary policy.
Existing policies
• In existing contextual and nonstationary policies, one requirement, response time, cannot be satisfied because mechanisms to track change are time-consuming.
• By contrast, a policy that satisfies the requirement with a lightweight mechanism cannot satisfy the other requirement, the temporal change. →
• The proposed policy assumes the contextual and nonstationary reward model by the following linear Gaussian state space (LGSS) model. Modeling reward using LGSS model 17 AAADi3ichVFNa9RQFL1p1NbR2qluBDfBoWGkYXhpi0pxoCiCK5lOO22xacNL5nXm0ZcPk5eBGvIHXAsuREHBhbj3D7jxD7joTxCXFdy48CaT0trB6QvJu/fcc+494Tqh4LEk5FCZUC9cvDQ5dbly5er0tZnq7PWNOEgil3XcQATRlkNjJrjPOpJLwbbCiFHPEWzT2X+U1zcHLIp54K/Lg5DteLTn8z3uUomQPatMRrbUm89saTlealER9mlmy3mLhTEXyJCGpmsnmRVzz/Ko7LtUpE+zOjEQ6Xl0d8FOj1nZHUO3rMrphqmcNzO9uT46RlJbto+n5Mn4CZJi95z7PKFdTTZNw+oGMjYsSZN8ZgHrOeHUHPNMT7tbL8pegjWjZZto2K7WSIMURxsNzDKoQXlaQfULWNCFAFxIwAMGPkiMBVCI8dkGEwiEiO1AiliEES/qDDKooDZBFkMGRXQfvz3MtkvUxzzvGRdqF6cIfCNUajBHvpNP5Ih8I5/JD/Lnv73Sokfu5QBvZ6hloT3z8uba73NVHt4S+ieqsZ4l7MH9witH72GB5H/hDvWDF6+P1pbbc6lOPpCf6P89OSRf8Q/8wS/34yprvxnjx0EvGa7HPLuM0WBjoWHebSytLtVWHpaLmoJbcBvquI17sAJPoAUdcBVfeaW8Vd6p0+qiuqw+GFInlFJzA/456uO/GPDuaw== rt = Zt↵t + ✏t, ✏t ⇠ N(0, 2 ✏ ), ↵t+1 = Tt↵t + ⌘tRt, ⌘t ⇠ N(0, 2 ⌘ ), t = 1, . . . , ⌧ ↵1 ⇠ Nd(µ1, P1), • It represents that state follows the previous state and reward follows state . αt αt−1 rt αt Rt t + 1 Zt Tt ηt ϵt rt αt αt+1 + +
• The proposed policy assumes the contextual and nonstationary reward model by the following linear Gaussian state space (LGSS) model. Modeling reward using LGSS model 18 AAADi3ichVFNa9RQFL1p1NbR2qluBDfBoWGkYXhpi0pxoCiCK5lOO22xacNL5nXm0ZcPk5eBGvIHXAsuREHBhbj3D7jxD7joTxCXFdy48CaT0trB6QvJu/fcc+494Tqh4LEk5FCZUC9cvDQ5dbly5er0tZnq7PWNOEgil3XcQATRlkNjJrjPOpJLwbbCiFHPEWzT2X+U1zcHLIp54K/Lg5DteLTn8z3uUomQPatMRrbUm89saTlealER9mlmy3mLhTEXyJCGpmsnmRVzz/Ko7LtUpE+zOjEQ6Xl0d8FOj1nZHUO3rMrphqmcNzO9uT46RlJbto+n5Mn4CZJi95z7PKFdTTZNw+oGMjYsSZN8ZgHrOeHUHPNMT7tbL8pegjWjZZto2K7WSIMURxsNzDKoQXlaQfULWNCFAFxIwAMGPkiMBVCI8dkGEwiEiO1AiliEES/qDDKooDZBFkMGRXQfvz3MtkvUxzzvGRdqF6cIfCNUajBHvpNP5Ih8I5/JD/Lnv73Sokfu5QBvZ6hloT3z8uba73NVHt4S+ieqsZ4l7MH9witH72GB5H/hDvWDF6+P1pbbc6lOPpCf6P89OSRf8Q/8wS/34yprvxnjx0EvGa7HPLuM0WBjoWHebSytLtVWHpaLmoJbcBvquI17sAJPoAUdcBVfeaW8Vd6p0+qiuqw+GFInlFJzA/456uO/GPDuaw== rt = Zt↵t + ✏t, ✏t ⇠ N(0, 2 ✏ ), ↵t+1 = Tt↵t + ⌘tRt, ⌘t ⇠ N(0, 2 ⌘ ), t = 1, . . . , ⌧ ↵1 ⇠ Nd(µ1, P1), Advantages
• The model can naturally represent the temporal change of the state.
• The design of the coefficient matrix ( ) provides flexible representations for the contextual and nonstationary problem setting.
• A lightweight estimator algorithm does not affect response time. Z, T, R
• The designed matrices handle time-varying contexts in the LGSS model.
• These matrices vary according to the context at each time point . xt t 20 Context-aware coefficient matrix AAACvXichVHLSuRAFD2dcXz0+OgZN4KbYKPMQppqEWcQhhHd6K67tVW0NSSx1GDlQVLdo4b8gD/gwo0KLsS9PyCIO9248BMGlwpuXHg7HRAV9YZKnXvqnlunuIYnrEAydpNSvjR9bW5pbUt/a+/o7Mp8/zEbuFXf5GXTFa4/b+gBF5bDy9KSgs97PtdtQ/A5Y2Oifj5X435guc6M3PL4kq2vOdaqZeqSKC0zuqBJ9Y9aMexwM9LkcliRrhcNqqUX9KA608htXa6buginIs3WMlmWY3Gob0E+AVkkUXAzp6hgBS5MVGGDw4EkLKAjoG8ReTB4xC0hJM4nZMXnHBHSpK1SFacKndgN+q9RtpiwDuX1nkGsNukWQcsnpYp+ds2O2R27YCfsP3t8t1cY96h72aLdaGi5p3Xt9Ew/fKqyaZdYf1Z96FliFb9jrxZ592Km/gqzoa9t795Nj5b6wwF2yG7J/wG7YWf0Aqd2bx4VeWnvAz8GeYloPPnXw3gLZody+ZHccHE4OzaeDKoVvejDT5rGL4xhEgWUqfs+znGJK+WvwhWhOI1SJZVouvEilH9P00imgg== Zt = x> t , Rt = xt, Tt = Im AAACm3icSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxCTlVlfUxpcoxGTmKcRUKxjoKBjG1MblxgsoG+gZgIECJsMQylBmgIKAfIFtDDEMKQz5DMkMpQy5DKkMeQwlQHYOQyJDMRBGMxgyGDAUAMViGaqBYkVAViZYPpWhloELqLcUqCoVqCIRKJoNJNOBvGioaB6QDzKzGKw7GWhLDhAXAXUqMKgaXDVYafDZ4ITBaoOXBn9wmlUNNgPklkognQTRm1oQz98lEfydoK5cIF3CkIHQhdfNJQxpDBZgt2YC3V4AFgH5Ihmiv6xq+udgqyDVajWDRQavge5faHDT4DDQB3llX5KXBqYGzcbjniSgW2qB0WOIHhmYjDAjPUMzPZNAE2UHJ2hEcTBIMygxaABjw5zBgcGDIYAhFGh6HcMShrUM65hkmZyZvJh8IEqZGKF6hBlQAFMoAGZ8mKs= xt 2 {0, 1}m
• The error is added only to the element in the corresponding state of the context .
• The design enables the modeling of state transitions according to the frequency of occurrence of context elements, which is expected to stabilize the accuracy of predictions. ηt αt xt • The designed matrices handle time-varying contexts in the LGSS model.
• These matrices vary according to the context at each time point .
• When these matrices are applied to the model: xt t 22 Context-aware coefficient matrix AAACvXichVHLSuRAFD2dcXz0+OgZN4KbYKPMQppqEWcQhhHd6K67tVW0NSSx1GDlQVLdo4b8gD/gwo0KLsS9PyCIO9248BMGlwpuXHg7HRAV9YZKnXvqnlunuIYnrEAydpNSvjR9bW5pbUt/a+/o7Mp8/zEbuFXf5GXTFa4/b+gBF5bDy9KSgs97PtdtQ/A5Y2Oifj5X435guc6M3PL4kq2vOdaqZeqSKC0zuqBJ9Y9aMexwM9LkcliRrhcNqqUX9KA608htXa6buginIs3WMlmWY3Gob0E+AVkkUXAzp6hgBS5MVGGDw4EkLKAjoG8ReTB4xC0hJM4nZMXnHBHSpK1SFacKndgN+q9RtpiwDuX1nkGsNukWQcsnpYp+ds2O2R27YCfsP3t8t1cY96h72aLdaGi5p3Xt9Ew/fKqyaZdYf1Z96FliFb9jrxZ592Km/gqzoa9t795Nj5b6wwF2yG7J/wG7YWf0Aqd2bx4VeWnvAz8GeYloPPnXw3gLZody+ZHccHE4OzaeDKoVvejDT5rGL4xhEgWUqfs+znGJK+WvwhWhOI1SJZVouvEilH9P00imgg== Zt = x> t , Rt = xt, Tt = Im AAACm3icSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxCTlVlfUxpcoxGTmKcRUKxjoKBjG1MblxgsoG+gZgIECJsMQylBmgIKAfIFtDDEMKQz5DMkMpQy5DKkMeQwlQHYOQyJDMRBGMxgyGDAUAMViGaqBYkVAViZYPpWhloELqLcUqCoVqCIRKJoNJNOBvGioaB6QDzKzGKw7GWhLDhAXAXUqMKgaXDVYafDZ4ITBaoOXBn9wmlUNNgPklkognQTRm1oQz98lEfydoK5cIF3CkIHQhdfNJQxpDBZgt2YC3V4AFgH5Ihmiv6xq+udgqyDVajWDRQavge5faHDT4DDQB3llX5KXBqYGzcbjniSgW2qB0WOIHhmYjDAjPUMzPZNAE2UHJ2hEcTBIMygxaABjw5zBgcGDIYAhFGh6HcMShrUM65hkmZyZvJh8IEqZGKF6hBlQAFMoAGZ8mKs= xt 2 {0, 1}m AAADXHichVFNa9RQFL2ZVK1Ta0cFEdwEhw6VhuGlFBWhWHTjSvrhtIW+NrxkXmdCXz5MbgZryB/wD7hwpeBC3PsH3PgHXMzGvbhsoRsX3iRTWi2OL+S9e887597zuE6kvAQZG2o1feLCxUuTl+tTV6avzjSuXd9IwjR2ZccNVRhvOSKRygtkBz1UciuKpfAdJTed/SfF/eZAxokXBs/xIJI7vugF3p7nCiTIbgxjG1tL3PGzl7mNuxnHMMqLlAsV9QVh81xGiaeIjKbRMk4znng+9wX2XaGyZ/kcMwnp+WJ3wc5OWPlds8V5/WzBDOetvGp5tgcKG40TH1WjAhrfBAU1KLgvUtE1cMkyeTfExOQoUrvRZG1WLuN8YI2CJozWStj4DBy6EIILKfggIQCkWIGAhL5tsIBBRNgOZITFFHnlvYQc6qRNiSWJIQjdp71H2fYIDSgvaial2qUuiv6YlAbMsm/sIztkX9kn9oP9+metrKxReDmg06m0MrJnXt9aP/6vyqcToX+qGusZYQ8elF498h6VSPEKt9IPXr05XH+4Npu12Hv2k/y/Y0P2hV4QDI7cD6ty7e0YPw55yWk81t/DOB9sLLSte+3F1cXm8uPRoCbhNtyBOZrGfViGp7ACHXC1R5rUAi2sfdcn9Cl9uqLWtJHmBvyx9Ju/AaLC4A0= rt = x> t ↵t + ✏t, ✏t ⇠ N(0, 2 ✏ ), ↵t+1 = ↵t + ⌘txt, ⌘t ⇠ N(0, 2 ⌘ ), t = 1, . . . , ⌧
23 Context-aware coefficient matrix • The design is based on the structural time-series model using an LGSS model.
• For example, if a trend can be assumed for the state transition, it can be extended to the following matrix: AAADunichVHNbtNAEB7X/JTy0xQuSFxWRI1aKYrWqCoIgVTBBW5tmrQVcWrZm3Wyite27E3UYPkFeAEOiANIHBB3XqAXDnDk0EdAHIvEhQNjx2lVStqxvDs7M98332ic0BOxovRAm9EvXLx0efbK3NVr12/MlxZubsXBIGK8yQIviHYcO+ae8HlTCeXxnTDitnQ8vu30n2b57SGPYhH4DTUKeVvaXV+4gtkKQ9aCtvrCUuQxWTIdmeyllto1pa16sZs00irJfcdNaLqbGMRUQvKYyHS5Suo5yvS4q1rEdHhX+IkdRfYoTRhLyYSNVE5wyAmHgSXm1FTlGG9yv1MQEzMS3Z5qV0njnOYZLbO95HlqyYxM8T2VdITdTY/GXJ7aX6YT0UcU/xFhlcq0RnMjpx2jcMpQ2HpQ+gwmdCAABgOQwMEHhb4HNsT4tcAACiHG2pBgLEJP5HkOKcwhdoBVHCtsjPbx7OKrVUR9fGeccY5m2MXDP0IkgUX6nX6kh/QL/UR/0D9TuZKcI9MywtsZY3lozb+6vfn7XJTEW0HvGHWmZgUuPMi1CtQe5pFsCjbGD1++Ptx8WF9MKvQ9/Yn639EDuo8T+MNf7MMGr785Q4+DWlJcj/HvMk47W/dqxmptZWOlvPakWNQs3IG7sITbuA9r8AzWoQlMe6vta1+1b/oj3dGF3h+XzmgF5hacMF39Bf56A1I= Zt = (xT t , 01⇥m), Rt = xt 0m⇥1 0m⇥1 xt , Tt = Im diag(xt) 0m⇥m Im AAACm3icSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxCTlVlfUxpcoxGTmKcRUKxjoKBjG1MblxgsoG+gZgIECJsMQylBmgIKAfIFtDDEMKQz5DMkMpQy5DKkMeQwlQHYOQyJDMRBGMxgyGDAUAMViGaqBYkVAViZYPpWhloELqLcUqCoVqCIRKJoNJNOBvGioaB6QDzKzGKw7GWhLDhAXAXUqMKgaXDVYafDZ4ITBaoOXBn9wmlUNNgPklkognQTRm1oQz98lEfydoK5cIF3CkIHQhdfNJQxpDBZgt2YC3V4AFgH5Ihmiv6xq+udgqyDVajWDRQavge5faHDT4DDQB3llX5KXBqYGzcbjniSgW2qB0WOIHhmYjDAjPUMzPZNAE2UHJ2hEcTBIMygxaABjw5zBgcGDIYAhFGh6HcMShrUM65hkmZyZvJh8IEqZGKF6hBlQAFMoAGZ8mKs= xt 2 {0, 1}m AAACpXichVG7SgNBFD2u73fURrAJBh9VuBFREQTRxkbwlSgkYdldR7O4L3YnAQ3pxR+wsFKwEMFS7W38AQs/QSwVbCy8u1kQFfUOM3PmzD13znB1zzIDSfTYoDQ2Nbe0trV3dHZ19/Qm+vpzgVv2DZE1XMv1t3QtEJbpiKw0pSW2PF9otm6JTX1vMbzfrAg/MF1nQ+57omhru465YxqaZEpNDBd0u1rQLK+k1VSZp1m7mJxLRqQsCRmSaiJFaYoi+RNkYpBCHCtu4hoFbMOFgTJsCDiQjC1oCHjkkQHBY66IKnM+IzO6F6ihg7VlzhKcoTG7x+sun/Ix6/A5rBlEaoNfsXj6rExihB7ogl7oni7pid5/rVWNaoRe9nnX61rhqb1Hg+tv/6ps3iVKn6o/PUvsYCbyarJ3L2LCXxh1feXg+GV9dm2kOkpn9Mz+T+mR7vgHTuXVOF8Vayd/+NHZS43bk/nejJ8gN5HOTKUnVydT8wtxo9owhGGMczemMY8lrCDL1Q9xhRvcKmPKsrKh5OqpSkOsGcCXUNQPIGmdsg== ↵t[0 : m] = ✓t
Estimation state using linear Kalman filter 25 AAAC6nichVFNSxxBEH2OMdFV40YvQi7iogSUpUeWJAQCklxyXNddFRwdesbetXG+mOld2IzzB3LMJQRPCckh5J4/kEsuHj34E8TjCl4MpGZ2IH6g1tDTr1/Vq35NWYEjI8XY8YA2+GDo4aPhkcLo2PjjieKTybXIb4e2aNi+44cbFo+EIz3RUFI5YiMIBXctR6xbe2/T/HpHhJH0vbrqBmLL5S1PNqXNFVFmsWVYbmy47cSM1YKezL+um+oSta+SRcMoVC9lq32a0LbhcrUbNeN6smBEsuXy7SUzNoTiSc1UtSsFZrHEyiyLmZtAz0EJeVT94i8Y2IEPG224EPCgCDvgiOjbhA6GgLgtxMSFhGSWF0hQIG2bqgRVcGL36N+i02bOenROe0aZ2qZbHFohKWcwx47YD9Zjf9hPdsIubu0VZz1SL13arb5WBObEh+nV83tVLu0Ku/9Vd3pWaOJl5lWS9yBj0lfYfX3n/afe6qvaXDzPvrJT8v+FHbPf9AKvc2Z/XxG1gzv8WOQlHY9+fRg3wdpSWX9erqxUSstv8kEN4ylm8Yym8QLLeIcqGtT9ED1c4K/maB+1z9pBv1QbyDVTuBLat3985rtY µt+1 = Ttµt|t , Pt+1 = TtPt|t TT t + 2 ⌘ RtRT t AAADBXichVG7SsRQEJ3E9/patVCwCS6KIrvciKgIgiiIsM36WBWNhiTe1WBeJHcXNKaxEfwBCysFCxE7sbWw8Qcs/APFUsHGwkk24At1QnLPnJkzOZdRHUP3GCH3HF9RWVVdU1uXqG9obGpOtrQueHbR1Whesw3bXVIVjxq6RfNMZwZdclyqmKpBF9WtybC+WKKup9vWPNt26KqpbFh6QdcUhpSc3JNU05fMouyzXRYEQs+YEDOBzIR+IRtmJcSSlPhe683JbFlma5KpsE2v4M8HU5j5aTHo+6TKlUeHk7FfSAtZ7Mp+EsnJFMmQKISfQIxBCuLI2ckrkGAdbNCgCCZQsIAhNkABD58VEIGAg9wq+Mi5iPSoTiGABGqL2EWxQ0F2C78bmK3ErIV5ONOL1Br+xcDXRaUA3eSOnJFnckvOySN5+3WWH80IvWzjqZa11JGbDzrmXv9VmXgy2PxQ/emZQQFGIq86enciJryFVtaXdg6f50Znu/0eckKe0P8xuSc3eAOr9KKdztDZoz/8qOglXI/4fRk/wcJARhzKDM4MpsYn4kXVQid0QS9uYxjGYRpykMfpD1wT18518Pv8BX/JX5VbeS7WtMGX4K/fAYilwD0= µt|t = µt + Kvt = µt + (PtZT t F 1 t )vt Pt|t = Pt KFtKT AAACnHichVHLSsNAFD3G97vqRhGkWCquylRERRBEXQgiaGsf2EpI4qjBvEimBQ3FvT/gwpWCCxF05w+48Qdc9BPEZQU3LrxNA6Ki3jCZc8/cc+cMV3UM3ROMVZuk5pbWtvaOzq7unt6+/sjAYNazS67GM5pt2G5eVTxu6BbPCF0YPO+4XDFVg+fUw+X6ea7MXU+3rS1x5PAdU9m39D1dUwRRcmSkeKAI363IIroQ3ZZFUTX9olmiXI7EWIIFEf0JkiGIIYwNO3KPInZhQ0MJJjgsCMIGFHj0FZAEg0PcDnziXEJ6cM5RQRdpS1TFqUIh9pD++5QVQtaivN7TC9Qa3WLQckkZRZw9sWtWY4/shj2z9197+UGPupcj2tWGljty/+lw+u1flUm7wMGn6k/PAnuYC7zq5N0JmPortIa+fHxWS8+n4v4Eu2Qv5P+CVdkDvcAqv2pXmzx1/ocflbxUaDzJ78P4CbJTieRMYnpzOra4FA6qA6MYxyRNYxaLWMUGMtT9BFe4xZ00Jq1Ia9J6o1RqCjVD+BJS9gPesZpA ˆ rt = Ztµt AAACznichVHLShxBFD22JjFjEkfdBNwMGZRAyFAjEkNAEAMiuBkfoxJbm+q2ZqawX3TVNGjTZBvyAy6ySiALcZVNfiCb/EAQd7oUlwayySK3expCIjG3qKpTp+65dYprh65UmrHTPqN/4NbtO4N3S0P37j8YLo+MrqugGzmi6QRuEG3aXAlX+qKppXbFZhgJ7tmu2LD3Xmb3G7GIlAz8Nb0fim2Pt33Zkg7XRFnlJdP2kji1dGVythJZ+qnZ4TqJiDDN0kKPfmXphqVp3TE9rjuqlaylT0wl2x7fmbISU4RKuoGfWuUqq7E8KtdBvQBVFNEIyp9hYhcBHHThQcCHJuyCQ9HYQh0MIXHbSIiLCMn8XiBFibRdyhKUwYndo7VNp62C9emc1VS52qFXXJoRKSuYYN/YEbtiX9kxu2A//1kryWtkXvZpt3taEVrDbx+u/vivyqNdo/NbdaNnjRae514leQ9zJvuF09PHB4dXqy9WJpJJ9oFdkv/37JR9oR/48Xfn47JYeXeDH5u8ZO2p/92M62B9qlZ/Vptenq7OzReNGsQ4HuExdWMGc1hEA02q/gknOMO50TBiIzVe91KNvkIzhj/CePMLz5WuMA== vt = rt ˆ rt Ft = ZtPtZT t + 2 ✏ Filtering One-step ahead prediction Prediction of observation Error of observation • The Kalman filter continuously estimates the state by repeating one-step ahead prediction and filtering.
• State follows a multivariate normal distribution with mean variance-covariance . α 𝒩 m μ P Cycle of state updates in linear Kalman filter t = t+1
Estimation state using linear Kalman fi lter (again) 28 AAAC6nichVFNSxxBEH2OMdFV40YvQi7iogSUpUeWJAQCklxyXNddFRwdesbetXG+mOld2IzzB3LMJQRPCckh5J4/kEsuHj34E8TjCl4MpGZ2IH6g1tDTr1/Vq35NWYEjI8XY8YA2+GDo4aPhkcLo2PjjieKTybXIb4e2aNi+44cbFo+EIz3RUFI5YiMIBXctR6xbe2/T/HpHhJH0vbrqBmLL5S1PNqXNFVFmsWVYbmy47cSM1YKezL+um+oSta+SRcMoVC9lq32a0LbhcrUbNeN6smBEsuXy7SUzNoTiSc1UtSsFZrHEyiyLmZtAz0EJeVT94i8Y2IEPG224EPCgCDvgiOjbhA6GgLgtxMSFhGSWF0hQIG2bqgRVcGL36N+i02bOenROe0aZ2qZbHFohKWcwx47YD9Zjf9hPdsIubu0VZz1SL13arb5WBObEh+nV83tVLu0Ku/9Vd3pWaOJl5lWS9yBj0lfYfX3n/afe6qvaXDzPvrJT8v+FHbPf9AKvc2Z/XxG1gzv8WOQlHY9+fRg3wdpSWX9erqxUSstv8kEN4ylm8Yym8QLLeIcqGtT9ED1c4K/maB+1z9pBv1QbyDVTuBLat3985rtY µt+1 = Ttµt|t , Pt+1 = TtPt|t TT t + 2 ⌘ RtRT t AAADBXichVG7SsRQEJ3E9/patVCwCS6KIrvciKgIgiiIsM36WBWNhiTe1WBeJHcXNKaxEfwBCysFCxE7sbWw8Qcs/APFUsHGwkk24At1QnLPnJkzOZdRHUP3GCH3HF9RWVVdU1uXqG9obGpOtrQueHbR1Whesw3bXVIVjxq6RfNMZwZdclyqmKpBF9WtybC+WKKup9vWPNt26KqpbFh6QdcUhpSc3JNU05fMouyzXRYEQs+YEDOBzIR+IRtmJcSSlPhe683JbFlma5KpsE2v4M8HU5j5aTHo+6TKlUeHk7FfSAtZ7Mp+EsnJFMmQKISfQIxBCuLI2ckrkGAdbNCgCCZQsIAhNkABD58VEIGAg9wq+Mi5iPSoTiGABGqL2EWxQ0F2C78bmK3ErIV5ONOL1Br+xcDXRaUA3eSOnJFnckvOySN5+3WWH80IvWzjqZa11JGbDzrmXv9VmXgy2PxQ/emZQQFGIq86enciJryFVtaXdg6f50Znu/0eckKe0P8xuSc3eAOr9KKdztDZoz/8qOglXI/4fRk/wcJARhzKDM4MpsYn4kXVQid0QS9uYxjGYRpykMfpD1wT18518Pv8BX/JX5VbeS7WtMGX4K/fAYilwD0= µt|t = µt + Kvt = µt + (PtZT t F 1 t )vt Pt|t = Pt KFtKT AAACnHichVHLSsNAFD3G97vqRhGkWCquylRERRBEXQgiaGsf2EpI4qjBvEimBQ3FvT/gwpWCCxF05w+48Qdc9BPEZQU3LrxNA6Ki3jCZc8/cc+cMV3UM3ROMVZuk5pbWtvaOzq7unt6+/sjAYNazS67GM5pt2G5eVTxu6BbPCF0YPO+4XDFVg+fUw+X6ea7MXU+3rS1x5PAdU9m39D1dUwRRcmSkeKAI363IIroQ3ZZFUTX9olmiXI7EWIIFEf0JkiGIIYwNO3KPInZhQ0MJJjgsCMIGFHj0FZAEg0PcDnziXEJ6cM5RQRdpS1TFqUIh9pD++5QVQtaivN7TC9Qa3WLQckkZRZw9sWtWY4/shj2z9197+UGPupcj2tWGljty/+lw+u1flUm7wMGn6k/PAnuYC7zq5N0JmPortIa+fHxWS8+n4v4Eu2Qv5P+CVdkDvcAqv2pXmzx1/ocflbxUaDzJ78P4CbJTieRMYnpzOra4FA6qA6MYxyRNYxaLWMUGMtT9BFe4xZ00Jq1Ia9J6o1RqCjVD+BJS9gPesZpA ˆ rt = Ztµt AAACznichVHLShxBFD22JjFjEkfdBNwMGZRAyFAjEkNAEAMiuBkfoxJbm+q2ZqawX3TVNGjTZBvyAy6ySiALcZVNfiCb/EAQd7oUlwayySK3expCIjG3qKpTp+65dYprh65UmrHTPqN/4NbtO4N3S0P37j8YLo+MrqugGzmi6QRuEG3aXAlX+qKppXbFZhgJ7tmu2LD3Xmb3G7GIlAz8Nb0fim2Pt33Zkg7XRFnlJdP2kji1dGVythJZ+qnZ4TqJiDDN0kKPfmXphqVp3TE9rjuqlaylT0wl2x7fmbISU4RKuoGfWuUqq7E8KtdBvQBVFNEIyp9hYhcBHHThQcCHJuyCQ9HYQh0MIXHbSIiLCMn8XiBFibRdyhKUwYndo7VNp62C9emc1VS52qFXXJoRKSuYYN/YEbtiX9kxu2A//1kryWtkXvZpt3taEVrDbx+u/vivyqNdo/NbdaNnjRae514leQ9zJvuF09PHB4dXqy9WJpJJ9oFdkv/37JR9oR/48Xfn47JYeXeDH5u8ZO2p/92M62B9qlZ/Vptenq7OzReNGsQ4HuExdWMGc1hEA02q/gknOMO50TBiIzVe91KNvkIzhj/CePMLz5WuMA== vt = rt ˆ rt Ft = ZtPtZT t + 2 ✏ Filtering reduces the values of P One-step ahead prediction increases the values of P Prediction of observation Error of observation • The Kalman filter continuously estimates the state by repeating one-step ahead prediction and filtering.
• State follows a multivariate normal distribution with mean variance-covariance . α 𝒩 m μ P Cycle of state updates in linear Kalman filter t = t+1
Virtual exploration using missing observations 29 AAAC6nichVFNSxxBEH2OMdFV40YvQi7iogSUpUeWJAQCklxyXNddFRwdesbetXG+mOld2IzzB3LMJQRPCckh5J4/kEsuHj34E8TjCl4MpGZ2IH6g1tDTr1/Vq35NWYEjI8XY8YA2+GDo4aPhkcLo2PjjieKTybXIb4e2aNi+44cbFo+EIz3RUFI5YiMIBXctR6xbe2/T/HpHhJH0vbrqBmLL5S1PNqXNFVFmsWVYbmy47cSM1YKezL+um+oSta+SRcMoVC9lq32a0LbhcrUbNeN6smBEsuXy7SUzNoTiSc1UtSsFZrHEyiyLmZtAz0EJeVT94i8Y2IEPG224EPCgCDvgiOjbhA6GgLgtxMSFhGSWF0hQIG2bqgRVcGL36N+i02bOenROe0aZ2qZbHFohKWcwx47YD9Zjf9hPdsIubu0VZz1SL13arb5WBObEh+nV83tVLu0Ku/9Vd3pWaOJl5lWS9yBj0lfYfX3n/afe6qvaXDzPvrJT8v+FHbPf9AKvc2Z/XxG1gzv8WOQlHY9+fRg3wdpSWX9erqxUSstv8kEN4ylm8Yym8QLLeIcqGtT9ED1c4K/maB+1z9pBv1QbyDVTuBLat3985rtY µt+1 = Ttµt|t , Pt+1 = TtPt|t TT t + 2 ⌘ RtRT t One-step ahead prediction increases the values of P • The Kalman filter can handle missing observations appropriately.
• The proposed policy incorporates this process as an update operation for the evaluation of the unselected arm in the MAB problem. Cycle of processing the missing observations in linear Kalman filter t = t+1 AAACuHichVG7SsRAFD3G9/rYVRvBJrgoVsus+EIQRBvL9bEqGIlJHNfBvEhmFzTsD/gDFlYKFmJvLzaitYWfIJYKNhbezQZFRb1hMmfOvefOGa7p2yKUjD00KI1NzS2tbe2pjs6u7nSmp3c19MqBxYuWZ3vBummE3BYuL0ohbb7uB9xwTJuvmXvztfxahQeh8NwVue/zTccouWJHWIYkSs+Ma6YTaU65qkdSc8S2Kqvq8MwHKTUtVfiSUgu6VPVMluVYHOpPkE9AFkkUvMwlNGzDg4UyHHC4kIRtGAjp20AeDD5xm4iICwiJOM9RRYq0ZariVGEQu0f/Ep02Etalc61nGKstusWmFZBSxRC7Z+fsmd2wC/bI3n7tFcU9al72aTfrWu7r6cP+5dd/VQ7tErufqj89S+xgKvYqyLsfM7VXWHV95eDoeXl6aSgaZqfsifyfsAd2TS9wKy/W2SJfOv7Dj0leqjSe/Pdh/ASro7n8RG5scSw7O5cMqg0DGMQITWMSs1hAAUXqfowr3OJOmVa2lJIi6qVKQ6Lpw5dQgndBlqSe µt|t = µt Pt|t = Pt Filtering keeps the values of P
Virtual exploration using missing observations 30 AAAC6nichVFNSxxBEH2OMdFV40YvQi7iogSUpUeWJAQCklxyXNddFRwdesbetXG+mOld2IzzB3LMJQRPCckh5J4/kEsuHj34E8TjCl4MpGZ2IH6g1tDTr1/Vq35NWYEjI8XY8YA2+GDo4aPhkcLo2PjjieKTybXIb4e2aNi+44cbFo+EIz3RUFI5YiMIBXctR6xbe2/T/HpHhJH0vbrqBmLL5S1PNqXNFVFmsWVYbmy47cSM1YKezL+um+oSta+SRcMoVC9lq32a0LbhcrUbNeN6smBEsuXy7SUzNoTiSc1UtSsFZrHEyiyLmZtAz0EJeVT94i8Y2IEPG224EPCgCDvgiOjbhA6GgLgtxMSFhGSWF0hQIG2bqgRVcGL36N+i02bOenROe0aZ2qZbHFohKWcwx47YD9Zjf9hPdsIubu0VZz1SL13arb5WBObEh+nV83tVLu0Ku/9Vd3pWaOJl5lWS9yBj0lfYfX3n/afe6qvaXDzPvrJT8v+FHbPf9AKvc2Z/XxG1gzv8WOQlHY9+fRg3wdpSWX9erqxUSstv8kEN4ylm8Yym8QLLeIcqGtT9ED1c4K/maB+1z9pBv1QbyDVTuBLat3985rtY µt+1 = Ttµt|t , Pt+1 = TtPt|t TT t + 2 ⌘ RtRT t One-step ahead prediction increases the values of P • The Kalman filter can handle missing observations appropriately.
• The proposed policy incorporates this process as an update operation for the evaluation of the unselected arm in the MAB problem. Cycle of processing the missing observations in linear Kalman filter t = t+1 AAACuHichVG7SsRAFD3G9/rYVRvBJrgoVsus+EIQRBvL9bEqGIlJHNfBvEhmFzTsD/gDFlYKFmJvLzaitYWfIJYKNhbezQZFRb1hMmfOvefOGa7p2yKUjD00KI1NzS2tbe2pjs6u7nSmp3c19MqBxYuWZ3vBummE3BYuL0ohbb7uB9xwTJuvmXvztfxahQeh8NwVue/zTccouWJHWIYkSs+Ma6YTaU65qkdSc8S2Kqvq8MwHKTUtVfiSUgu6VPVMluVYHOpPkE9AFkkUvMwlNGzDg4UyHHC4kIRtGAjp20AeDD5xm4iICwiJOM9RRYq0ZariVGEQu0f/Ep02Etalc61nGKstusWmFZBSxRC7Z+fsmd2wC/bI3n7tFcU9al72aTfrWu7r6cP+5dd/VQ7tErufqj89S+xgKvYqyLsfM7VXWHV95eDoeXl6aSgaZqfsifyfsAd2TS9wKy/W2SJfOv7Dj0leqjSe/Pdh/ASro7n8RG5scSw7O5cMqg0DGMQITWMSs1hAAUXqfowr3OJOmVa2lJIi6qVKQ6Lpw5dQgndBlqSe µt|t = µt Pt|t = Pt By introducing this behavior into the policy, the exploration for the unselected arm can be encouraged by the mechanism of the probability matching method. Filtering keeps the values of P
32 Evaluation setting • The data were collected from an actual EC site.
• Each product is assigned to one of 18 categories.
• Four recommendation methods are used.
• Counts of the recommendations of each method and the clicks for each recommendation in each product category hourly from June 20, 2019, through July 22, 2019.
• The total number of recommendations is 1,488,626. 0 100 200 300 400 500 600 Hours 0.05 0.10 0.15 Click-through rate Browsing path Demographic LLR Similar image The highest CTR switches.
Typical characteristics of the recommendation method appears. Fig. 2. Change of CTR for each recommendation method.
Effect of context and temporal change 34 0 5000 10000 15000 20000 Cumulative regret adaptive thompson sampling decay linucb dynamic linucb LGSS bandits linear thompson sampling linucb neural linear select best arm at first select random arm 0 100 200 300 400 500 600 0.0 0.2 0.4 0.6 0.8 1.0 Best arm rate • The proposed LGSS bandits achieved the highest cumulative reward, 201,330.4.
• LGSS bandits and adaptive Thompson sampling reduce overall opportunity loss, not only during best arm switching but also during stationary periods.
Effect of the response time 36 0.0 0.2 0.4 ms 0.48 0.19 0.04 0.24 0.29 0.04 0.27 Select per 1 time 10°1 100 sec (Log) 1.49 0.04 0.06 0.09 0.04 0.05 1.82 Update per 2665 times adaptive thompson sampling decay linucb dynamic linucb LGSS bandits linear thompson sampling linucb neural linear Elapsed time for 2665 times for 18 dimension 4 arms • We measured the time elapsed from the arm selection and evaluation updates for each policy.
• For arm selection, the LGSS bandits had an elapsed time of 0.24 ms.
• All measures were less than 0.5 ms and considered to have sufficient performance without significant impact.
• Measures of evaluation updates, the elapsed time was sufficiently short compared with the unit time (1 h).
• Some policy are not scalable to an increase in the number of trials per unit time.
• We proposed an LGSS-based MAB policy to optimize the selection of recommendation methods continuously and quickly according to the context and temporal change using the Kalman filter for meta-RSs.
• The experimental results demonstrated that the proposed policy can effectively reduce opportunity loss during evaluation and increase the cumulative number of clicks compared with baseline policies.
• The results also revealed that the proposed policy has no overhead on the response time of the meta-RS.
• In the future, we will develop general-purpose policies that can quickly handle nonlinearity. 38 Conclusion