Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Contextual and Nonstationary Multi-armed Bandits Using the Linear Gaussian State Space Model for the Meta-Recommender System

Contextual and Nonstationary Multi-armed Bandits Using the Linear Gaussian State Space Model for the Meta-Recommender System

IEEE SMC 2023
https://ieeesmc2023.org/
October 1-4, 2023, Honolulu, Oahu, Hawaii, USA

monochromegane

October 09, 2023
Tweet

More Decks by monochromegane

Other Decks in Research

Transcript

  1. Yusuke Miyake [1][2] and Tsunenori Mine [1] / Kyushu Univ [1]. Pepabo R&D Institute, GMO Pepabo, Inc [2].


    October 1-4, 2023. The 2023 IEEE Conference on Systems, Man, and Cybernetics
    Contextual and Nonstationary Multi-armed Bandits
    Using the Linear Gaussian State Space Model for
    the Meta-Recommender System

    View Slide

  2. 1. Introduction


    2. Related works


    3. Proposal


    4. Empirical evaluation on real data and results


    5. Conclusion
    2
    Agenda

    View Slide

  3. 1.
    Introduction

    View Slide

  4. 4
    Background
    Users
    Best Recommendation


    Method
    Items
    • Selecting an optimal recommendation method from many ones is crucial for EC site.


    • The effectiveness of them cannot be known in advance.


    • Continuous comparative evaluation in an actual environment is essential.


    • However, it results in opportunity loss.


    IUUQTJDPOTDPN

    View Slide

  5. 5
    Previous studies
    • The reduction of opportunity loss is considered as a multi-armed bandit (MAB) problem.


    • Many meta-recommendation systems (meta-RSs) that use multi-armed bandit policies
    to select the efficient recommendation method have been proposed.
    Users
    Best Recommendation


    Method
    Items
    Multi-armed bandit to select the best one.
    IUUQTJDPOTDPN

    View Slide

  6. 6
    Issue in previous studies
    • The MAB policy used by meta-RSs must take into account the following characteristics
    of the recommendation method.

    1. A temporal change in the effectiveness of recommendation methods.


    2. The performance of a recommendation method depends on the context.


    3. The performance also depends on response time.


    • A policy that satisfies all three requirements based on these characteristics exists
    yet.

    View Slide

  7. 7
    Proposed MAB policy for the meta-RS
    Linear Gaussian state space bandits (LGSS bandits)


    • It addresses all three factors causing opportunity loss in the selection of
    recommendation methods.

    1. For temporal change, effectiveness changes are modeled as LGSS model.


    2. For context, a novel context-aware coefficient matrix for LGSS is designed.


    3. For response time, a linear Kalman filter is used to estimate the state.


    • A probability matching method of bandits is used to find best recommendation method
    based on the estimated state.

    View Slide

  8. 2.
    Related works

    View Slide

  9. • The player must select one arm from multiple arms to maximize the reward.


    • The reward is stochastically given based on the chosen arm.


    • To find the optimal arm, player should balance exploitation and exploration.

    • Two types of extended problems have been developed.


    • Contextual MAB problem


    • Nonstationary MAB problem
    9
    Multi-armed bandit (MAB) problem

    View Slide

  10. 10
    Basic problem setting
    Arm0
    Arm1
    Arm2
    Player
    Estimated probability distribution
    Probability distribution of rewards
    Select
    Best arm

    View Slide

  11. Contextual problem setting
    11
    Arm0
    Arm1
    Arm2
    Player
    Select
    Context = 0
    Estimated probability distribution
    Probability distribution of rewards
    Best arm for context = 0

    View Slide

  12. 12
    Arm0
    Arm1
    Arm2
    Player
    Select
    Context = 1
    Contextual problem setting
    Estimated probability distribution
    Probability distribution of rewards
    Best arm for context = 1

    View Slide

  13. Nonstationary problem setting
    13
    Arm0
    Arm1
    Arm2
    Player
    Select
    t = 0~ t = 100~ t = 0~ t = 100~
    t = 99
    Estimated probability distribution
    Probability distribution of rewards
    Best arm at t = 99

    View Slide

  14. 14
    Arm0
    Arm1
    Arm2
    Player
    Select
    t = 0~ t = 100~ t = 0~ t = 100~
    t = 199
    Nonstationary problem setting
    Estimated probability distribution
    Probability distribution of rewards
    Best arm at t = 199

    View Slide

  15. 15
    Meta-Recommender systems (meta-RSs)
    Policies of meta-RSs


    • Studies have used meta-RSs to select the best recommendation method based on
    evaluation in the actual environment.


    • These studies didn't employ contextual and nonstationary policy.


    Existing policies


    • In existing contextual and nonstationary policies, one requirement, response time,
    cannot be satisfied because mechanisms to track change are time-consuming.


    • By contrast, a policy that satisfies the requirement with a lightweight mechanism
    cannot satisfy the other requirement, the temporal change.

    View Slide

  16. 3.
    Proposal

    View Slide

  17. • The proposed policy assumes the contextual and nonstationary reward model by the
    following linear Gaussian state space (LGSS) model.
    Modeling reward using LGSS model
    17
    AAADi3ichVFNa9RQFL1p1NbR2qluBDfBoWGkYXhpi0pxoCiCK5lOO22xacNL5nXm0ZcPk5eBGvIHXAsuREHBhbj3D7jxD7joTxCXFdy48CaT0trB6QvJu/fcc+494Tqh4LEk5FCZUC9cvDQ5dbly5er0tZnq7PWNOEgil3XcQATRlkNjJrjPOpJLwbbCiFHPEWzT2X+U1zcHLIp54K/Lg5DteLTn8z3uUomQPatMRrbUm89saTlealER9mlmy3mLhTEXyJCGpmsnmRVzz/Ko7LtUpE+zOjEQ6Xl0d8FOj1nZHUO3rMrphqmcNzO9uT46RlJbto+n5Mn4CZJi95z7PKFdTTZNw+oGMjYsSZN8ZgHrOeHUHPNMT7tbL8pegjWjZZto2K7WSIMURxsNzDKoQXlaQfULWNCFAFxIwAMGPkiMBVCI8dkGEwiEiO1AiliEES/qDDKooDZBFkMGRXQfvz3MtkvUxzzvGRdqF6cIfCNUajBHvpNP5Ih8I5/JD/Lnv73Sokfu5QBvZ6hloT3z8uba73NVHt4S+ieqsZ4l7MH9witH72GB5H/hDvWDF6+P1pbbc6lOPpCf6P89OSRf8Q/8wS/34yprvxnjx0EvGa7HPLuM0WBjoWHebSytLtVWHpaLmoJbcBvquI17sAJPoAUdcBVfeaW8Vd6p0+qiuqw+GFInlFJzA/456uO/GPDuaw==
    rt = Zt↵t + ✏t, ✏t
    ⇠ N(0, 2

    ),
    ↵t+1 = Tt↵t + ⌘tRt, ⌘t
    ⇠ N(0, 2

    ), t = 1, . . . , ⌧
    ↵1
    ⇠ Nd(µ1, P1),
    • It represents that state follows the previous state and reward follows state .
    αt
    αt−1
    rt
    αt
    Rt t + 1 Zt
    Tt
    ηt
    ϵt
    rt
    αt
    αt+1
    +
    +

    View Slide

  18. • The proposed policy assumes the contextual and nonstationary reward model by the
    following linear Gaussian state space (LGSS) model.
    Modeling reward using LGSS model
    18
    AAADi3ichVFNa9RQFL1p1NbR2qluBDfBoWGkYXhpi0pxoCiCK5lOO22xacNL5nXm0ZcPk5eBGvIHXAsuREHBhbj3D7jxD7joTxCXFdy48CaT0trB6QvJu/fcc+494Tqh4LEk5FCZUC9cvDQ5dbly5er0tZnq7PWNOEgil3XcQATRlkNjJrjPOpJLwbbCiFHPEWzT2X+U1zcHLIp54K/Lg5DteLTn8z3uUomQPatMRrbUm89saTlealER9mlmy3mLhTEXyJCGpmsnmRVzz/Ko7LtUpE+zOjEQ6Xl0d8FOj1nZHUO3rMrphqmcNzO9uT46RlJbto+n5Mn4CZJi95z7PKFdTTZNw+oGMjYsSZN8ZgHrOeHUHPNMT7tbL8pegjWjZZto2K7WSIMURxsNzDKoQXlaQfULWNCFAFxIwAMGPkiMBVCI8dkGEwiEiO1AiliEES/qDDKooDZBFkMGRXQfvz3MtkvUxzzvGRdqF6cIfCNUajBHvpNP5Ih8I5/JD/Lnv73Sokfu5QBvZ6hloT3z8uba73NVHt4S+ieqsZ4l7MH9witH72GB5H/hDvWDF6+P1pbbc6lOPpCf6P89OSRf8Q/8wS/34yprvxnjx0EvGa7HPLuM0WBjoWHebSytLtVWHpaLmoJbcBvquI17sAJPoAUdcBVfeaW8Vd6p0+qiuqw+GFInlFJzA/456uO/GPDuaw==
    rt = Zt↵t + ✏t, ✏t
    ⇠ N(0, 2

    ),
    ↵t+1 = Tt↵t + ⌘tRt, ⌘t
    ⇠ N(0, 2

    ), t = 1, . . . , ⌧
    ↵1
    ⇠ Nd(µ1, P1),
    Advantages


    • The model can naturally represent the temporal change of the state.


    • The design of the coefficient matrix ( ) provides flexible representations for the
    contextual and nonstationary problem setting.


    • A lightweight estimator algorithm does not affect response time.
    Z, T, R

    View Slide

  19. Context-aware coefficient matrices

    for the LGSS model

    View Slide

  20. • The designed matrices handle time-varying contexts in the LGSS model.


    • These matrices vary according to the context at each time point .
    xt
    t
    20
    Context-aware coefficient matrix
    AAACvXichVHLSuRAFD2dcXz0+OgZN4KbYKPMQppqEWcQhhHd6K67tVW0NSSx1GDlQVLdo4b8gD/gwo0KLsS9PyCIO9248BMGlwpuXHg7HRAV9YZKnXvqnlunuIYnrEAydpNSvjR9bW5pbUt/a+/o7Mp8/zEbuFXf5GXTFa4/b+gBF5bDy9KSgs97PtdtQ/A5Y2Oifj5X435guc6M3PL4kq2vOdaqZeqSKC0zuqBJ9Y9aMexwM9LkcliRrhcNqqUX9KA608htXa6buginIs3WMlmWY3Gob0E+AVkkUXAzp6hgBS5MVGGDw4EkLKAjoG8ReTB4xC0hJM4nZMXnHBHSpK1SFacKndgN+q9RtpiwDuX1nkGsNukWQcsnpYp+ds2O2R27YCfsP3t8t1cY96h72aLdaGi5p3Xt9Ew/fKqyaZdYf1Z96FliFb9jrxZ592Km/gqzoa9t795Nj5b6wwF2yG7J/wG7YWf0Aqd2bx4VeWnvAz8GeYloPPnXw3gLZody+ZHccHE4OzaeDKoVvejDT5rGL4xhEgWUqfs+znGJK+WvwhWhOI1SJZVouvEilH9P00imgg==
    Zt = x>
    t
    , Rt = xt, Tt = Im
    AAACm3icSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxCTlVlfUxpcoxGTmKcRUKxjoKBjG1MblxgsoG+gZgIECJsMQylBmgIKAfIFtDDEMKQz5DMkMpQy5DKkMeQwlQHYOQyJDMRBGMxgyGDAUAMViGaqBYkVAViZYPpWhloELqLcUqCoVqCIRKJoNJNOBvGioaB6QDzKzGKw7GWhLDhAXAXUqMKgaXDVYafDZ4ITBaoOXBn9wmlUNNgPklkognQTRm1oQz98lEfydoK5cIF3CkIHQhdfNJQxpDBZgt2YC3V4AFgH5Ihmiv6xq+udgqyDVajWDRQavge5faHDT4DDQB3llX5KXBqYGzcbjniSgW2qB0WOIHhmYjDAjPUMzPZNAE2UHJ2hEcTBIMygxaABjw5zBgcGDIYAhFGh6HcMShrUM65hkmZyZvJh8IEqZGKF6hBlQAFMoAGZ8mKs=
    xt
    2 {0, 1}m

    View Slide

  21. • The equation is consistent with the reward formulation in contextual MAB.


    • It enables the estimated state to be used to solve the MAB.
    αt
    = θt
    • The designed matrices handle time-varying contexts in the LGSS model.


    • These matrices vary according to the context at each time point .


    • When these matrices are applied to the model:
    xt
    t
    21
    Context-aware coefficient matrix
    AAACvXichVHLSuRAFD2dcXz0+OgZN4KbYKPMQppqEWcQhhHd6K67tVW0NSSx1GDlQVLdo4b8gD/gwo0KLsS9PyCIO9248BMGlwpuXHg7HRAV9YZKnXvqnlunuIYnrEAydpNSvjR9bW5pbUt/a+/o7Mp8/zEbuFXf5GXTFa4/b+gBF5bDy9KSgs97PtdtQ/A5Y2Oifj5X435guc6M3PL4kq2vOdaqZeqSKC0zuqBJ9Y9aMexwM9LkcliRrhcNqqUX9KA608htXa6buginIs3WMlmWY3Gob0E+AVkkUXAzp6hgBS5MVGGDw4EkLKAjoG8ReTB4xC0hJM4nZMXnHBHSpK1SFacKndgN+q9RtpiwDuX1nkGsNukWQcsnpYp+ds2O2R27YCfsP3t8t1cY96h72aLdaGi5p3Xt9Ew/fKqyaZdYf1Z96FliFb9jrxZ592Km/gqzoa9t795Nj5b6wwF2yG7J/wG7YWf0Aqd2bx4VeWnvAz8GeYloPPnXw3gLZody+ZHccHE4OzaeDKoVvejDT5rGL4xhEgWUqfs+znGJK+WvwhWhOI1SJZVouvEilH9P00imgg==
    Zt = x>
    t
    , Rt = xt, Tt = Im
    AAACm3icSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxCTlVlfUxpcoxGTmKcRUKxjoKBjG1MblxgsoG+gZgIECJsMQylBmgIKAfIFtDDEMKQz5DMkMpQy5DKkMeQwlQHYOQyJDMRBGMxgyGDAUAMViGaqBYkVAViZYPpWhloELqLcUqCoVqCIRKJoNJNOBvGioaB6QDzKzGKw7GWhLDhAXAXUqMKgaXDVYafDZ4ITBaoOXBn9wmlUNNgPklkognQTRm1oQz98lEfydoK5cIF3CkIHQhdfNJQxpDBZgt2YC3V4AFgH5Ihmiv6xq+udgqyDVajWDRQavge5faHDT4DDQB3llX5KXBqYGzcbjniSgW2qB0WOIHhmYjDAjPUMzPZNAE2UHJ2hEcTBIMygxaABjw5zBgcGDIYAhFGh6HcMShrUM65hkmZyZvJh8IEqZGKF6hBlQAFMoAGZ8mKs=
    xt
    2 {0, 1}m
    AAADXHichVFNa9RQFL2ZVK1Ta0cFEdwEhw6VhuGlFBWhWHTjSvrhtIW+NrxkXmdCXz5MbgZryB/wD7hwpeBC3PsH3PgHXMzGvbhsoRsX3iRTWi2OL+S9e887597zuE6kvAQZG2o1feLCxUuTl+tTV6avzjSuXd9IwjR2ZccNVRhvOSKRygtkBz1UciuKpfAdJTed/SfF/eZAxokXBs/xIJI7vugF3p7nCiTIbgxjG1tL3PGzl7mNuxnHMMqLlAsV9QVh81xGiaeIjKbRMk4znng+9wX2XaGyZ/kcMwnp+WJ3wc5OWPlds8V5/WzBDOetvGp5tgcKG40TH1WjAhrfBAU1KLgvUtE1cMkyeTfExOQoUrvRZG1WLuN8YI2CJozWStj4DBy6EIILKfggIQCkWIGAhL5tsIBBRNgOZITFFHnlvYQc6qRNiSWJIQjdp71H2fYIDSgvaial2qUuiv6YlAbMsm/sIztkX9kn9oP9+metrKxReDmg06m0MrJnXt9aP/6vyqcToX+qGusZYQ8elF498h6VSPEKt9IPXr05XH+4Npu12Hv2k/y/Y0P2hV4QDI7cD6ty7e0YPw55yWk81t/DOB9sLLSte+3F1cXm8uPRoCbhNtyBOZrGfViGp7ACHXC1R5rUAi2sfdcn9Cl9uqLWtJHmBvyx9Ju/AaLC4A0=
    rt = x>
    t
    ↵t + ✏t, ✏t
    ⇠ N(0, 2

    ),
    ↵t+1 = ↵t + ⌘txt, ⌘t
    ⇠ N(0, 2

    ), t = 1, . . . , ⌧

    View Slide

  22. • The error is added only to the element in the corresponding state of the context .


    • The design enables the modeling of state transitions according to the frequency of
    occurrence of context elements, which is expected to stabilize the accuracy of predictions.
    ηt
    αt
    xt
    • The designed matrices handle time-varying contexts in the LGSS model.


    • These matrices vary according to the context at each time point .


    • When these matrices are applied to the model:
    xt
    t
    22
    Context-aware coefficient matrix
    AAACvXichVHLSuRAFD2dcXz0+OgZN4KbYKPMQppqEWcQhhHd6K67tVW0NSSx1GDlQVLdo4b8gD/gwo0KLsS9PyCIO9248BMGlwpuXHg7HRAV9YZKnXvqnlunuIYnrEAydpNSvjR9bW5pbUt/a+/o7Mp8/zEbuFXf5GXTFa4/b+gBF5bDy9KSgs97PtdtQ/A5Y2Oifj5X435guc6M3PL4kq2vOdaqZeqSKC0zuqBJ9Y9aMexwM9LkcliRrhcNqqUX9KA608htXa6buginIs3WMlmWY3Gob0E+AVkkUXAzp6hgBS5MVGGDw4EkLKAjoG8ReTB4xC0hJM4nZMXnHBHSpK1SFacKndgN+q9RtpiwDuX1nkGsNukWQcsnpYp+ds2O2R27YCfsP3t8t1cY96h72aLdaGi5p3Xt9Ew/fKqyaZdYf1Z96FliFb9jrxZ592Km/gqzoa9t795Nj5b6wwF2yG7J/wG7YWf0Aqd2bx4VeWnvAz8GeYloPPnXw3gLZody+ZHccHE4OzaeDKoVvejDT5rGL4xhEgWUqfs+znGJK+WvwhWhOI1SJZVouvEilH9P00imgg==
    Zt = x>
    t
    , Rt = xt, Tt = Im
    AAACm3icSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxCTlVlfUxpcoxGTmKcRUKxjoKBjG1MblxgsoG+gZgIECJsMQylBmgIKAfIFtDDEMKQz5DMkMpQy5DKkMeQwlQHYOQyJDMRBGMxgyGDAUAMViGaqBYkVAViZYPpWhloELqLcUqCoVqCIRKJoNJNOBvGioaB6QDzKzGKw7GWhLDhAXAXUqMKgaXDVYafDZ4ITBaoOXBn9wmlUNNgPklkognQTRm1oQz98lEfydoK5cIF3CkIHQhdfNJQxpDBZgt2YC3V4AFgH5Ihmiv6xq+udgqyDVajWDRQavge5faHDT4DDQB3llX5KXBqYGzcbjniSgW2qB0WOIHhmYjDAjPUMzPZNAE2UHJ2hEcTBIMygxaABjw5zBgcGDIYAhFGh6HcMShrUM65hkmZyZvJh8IEqZGKF6hBlQAFMoAGZ8mKs=
    xt
    2 {0, 1}m
    AAADXHichVFNa9RQFL2ZVK1Ta0cFEdwEhw6VhuGlFBWhWHTjSvrhtIW+NrxkXmdCXz5MbgZryB/wD7hwpeBC3PsH3PgHXMzGvbhsoRsX3iRTWi2OL+S9e887597zuE6kvAQZG2o1feLCxUuTl+tTV6avzjSuXd9IwjR2ZccNVRhvOSKRygtkBz1UciuKpfAdJTed/SfF/eZAxokXBs/xIJI7vugF3p7nCiTIbgxjG1tL3PGzl7mNuxnHMMqLlAsV9QVh81xGiaeIjKbRMk4znng+9wX2XaGyZ/kcMwnp+WJ3wc5OWPlds8V5/WzBDOetvGp5tgcKG40TH1WjAhrfBAU1KLgvUtE1cMkyeTfExOQoUrvRZG1WLuN8YI2CJozWStj4DBy6EIILKfggIQCkWIGAhL5tsIBBRNgOZITFFHnlvYQc6qRNiSWJIQjdp71H2fYIDSgvaial2qUuiv6YlAbMsm/sIztkX9kn9oP9+metrKxReDmg06m0MrJnXt9aP/6vyqcToX+qGusZYQ8elF498h6VSPEKt9IPXr05XH+4Npu12Hv2k/y/Y0P2hV4QDI7cD6ty7e0YPw55yWk81t/DOB9sLLSte+3F1cXm8uPRoCbhNtyBOZrGfViGp7ACHXC1R5rUAi2sfdcn9Cl9uqLWtJHmBvyx9Ju/AaLC4A0=
    rt = x>
    t
    ↵t + ✏t, ✏t
    ⇠ N(0, 2

    ),
    ↵t+1 = ↵t + ⌘txt, ⌘t
    ⇠ N(0, 2

    ), t = 1, . . . , ⌧

    View Slide

  23. 23
    Context-aware coefficient matrix
    • The design is based on the structural time-series model using an LGSS model.


    • For example, if a trend can be assumed for the state transition, it can be extended to
    the following matrix:
    AAADunichVHNbtNAEB7X/JTy0xQuSFxWRI1aKYrWqCoIgVTBBW5tmrQVcWrZm3Wyite27E3UYPkFeAEOiANIHBB3XqAXDnDk0EdAHIvEhQNjx2lVStqxvDs7M98332ic0BOxovRAm9EvXLx0efbK3NVr12/MlxZubsXBIGK8yQIviHYcO+ae8HlTCeXxnTDitnQ8vu30n2b57SGPYhH4DTUKeVvaXV+4gtkKQ9aCtvrCUuQxWTIdmeyllto1pa16sZs00irJfcdNaLqbGMRUQvKYyHS5Suo5yvS4q1rEdHhX+IkdRfYoTRhLyYSNVE5wyAmHgSXm1FTlGG9yv1MQEzMS3Z5qV0njnOYZLbO95HlqyYxM8T2VdITdTY/GXJ7aX6YT0UcU/xFhlcq0RnMjpx2jcMpQ2HpQ+gwmdCAABgOQwMEHhb4HNsT4tcAACiHG2pBgLEJP5HkOKcwhdoBVHCtsjPbx7OKrVUR9fGeccY5m2MXDP0IkgUX6nX6kh/QL/UR/0D9TuZKcI9MywtsZY3lozb+6vfn7XJTEW0HvGHWmZgUuPMi1CtQe5pFsCjbGD1++Ptx8WF9MKvQ9/Yn639EDuo8T+MNf7MMGr785Q4+DWlJcj/HvMk47W/dqxmptZWOlvPakWNQs3IG7sITbuA9r8AzWoQlMe6vta1+1b/oj3dGF3h+XzmgF5hacMF39Bf56A1I=
    Zt = (xT
    t
    , 01⇥m), Rt =

    xt 0m⇥1
    0m⇥1 xt
    , Tt =

    Im diag(xt)
    0m⇥m Im
    AAACm3icSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxCTlVlfUxpcoxGTmKcRUKxjoKBjG1MblxgsoG+gZgIECJsMQylBmgIKAfIFtDDEMKQz5DMkMpQy5DKkMeQwlQHYOQyJDMRBGMxgyGDAUAMViGaqBYkVAViZYPpWhloELqLcUqCoVqCIRKJoNJNOBvGioaB6QDzKzGKw7GWhLDhAXAXUqMKgaXDVYafDZ4ITBaoOXBn9wmlUNNgPklkognQTRm1oQz98lEfydoK5cIF3CkIHQhdfNJQxpDBZgt2YC3V4AFgH5Ihmiv6xq+udgqyDVajWDRQavge5faHDT4DDQB3llX5KXBqYGzcbjniSgW2qB0WOIHhmYjDAjPUMzPZNAE2UHJ2hEcTBIMygxaABjw5zBgcGDIYAhFGh6HcMShrUM65hkmZyZvJh8IEqZGKF6hBlQAFMoAGZ8mKs=
    xt
    2 {0, 1}m
    AAACpXichVG7SgNBFD2u73fURrAJBh9VuBFREQTRxkbwlSgkYdldR7O4L3YnAQ3pxR+wsFKwEMFS7W38AQs/QSwVbCy8u1kQFfUOM3PmzD13znB1zzIDSfTYoDQ2Nbe0trV3dHZ19/Qm+vpzgVv2DZE1XMv1t3QtEJbpiKw0pSW2PF9otm6JTX1vMbzfrAg/MF1nQ+57omhru465YxqaZEpNDBd0u1rQLK+k1VSZp1m7mJxLRqQsCRmSaiJFaYoi+RNkYpBCHCtu4hoFbMOFgTJsCDiQjC1oCHjkkQHBY66IKnM+IzO6F6ihg7VlzhKcoTG7x+sun/Ix6/A5rBlEaoNfsXj6rExihB7ogl7oni7pid5/rVWNaoRe9nnX61rhqb1Hg+tv/6ps3iVKn6o/PUvsYCbyarJ3L2LCXxh1feXg+GV9dm2kOkpn9Mz+T+mR7vgHTuXVOF8Vayd/+NHZS43bk/nejJ8gN5HOTKUnVydT8wtxo9owhGGMczemMY8lrCDL1Q9xhRvcKmPKsrKh5OqpSkOsGcCXUNQPIGmdsg==
    ↵t[0 : m] = ✓t

    View Slide

  24. MAB policy

    using the Kalman filter

    View Slide

  25. Estimation state using linear Kalman filter
    25
    AAAC6nichVFNSxxBEH2OMdFV40YvQi7iogSUpUeWJAQCklxyXNddFRwdesbetXG+mOld2IzzB3LMJQRPCckh5J4/kEsuHj34E8TjCl4MpGZ2IH6g1tDTr1/Vq35NWYEjI8XY8YA2+GDo4aPhkcLo2PjjieKTybXIb4e2aNi+44cbFo+EIz3RUFI5YiMIBXctR6xbe2/T/HpHhJH0vbrqBmLL5S1PNqXNFVFmsWVYbmy47cSM1YKezL+um+oSta+SRcMoVC9lq32a0LbhcrUbNeN6smBEsuXy7SUzNoTiSc1UtSsFZrHEyiyLmZtAz0EJeVT94i8Y2IEPG224EPCgCDvgiOjbhA6GgLgtxMSFhGSWF0hQIG2bqgRVcGL36N+i02bOenROe0aZ2qZbHFohKWcwx47YD9Zjf9hPdsIubu0VZz1SL13arb5WBObEh+nV83tVLu0Ku/9Vd3pWaOJl5lWS9yBj0lfYfX3n/afe6qvaXDzPvrJT8v+FHbPf9AKvc2Z/XxG1gzv8WOQlHY9+fRg3wdpSWX9erqxUSstv8kEN4ylm8Yym8QLLeIcqGtT9ED1c4K/maB+1z9pBv1QbyDVTuBLat3985rtY
    µt+1 = Ttµt|t
    ,
    Pt+1 = TtPt|t
    TT
    t
    + 2

    RtRT
    t
    AAADBXichVG7SsRQEJ3E9/patVCwCS6KIrvciKgIgiiIsM36WBWNhiTe1WBeJHcXNKaxEfwBCysFCxE7sbWw8Qcs/APFUsHGwkk24At1QnLPnJkzOZdRHUP3GCH3HF9RWVVdU1uXqG9obGpOtrQueHbR1Whesw3bXVIVjxq6RfNMZwZdclyqmKpBF9WtybC+WKKup9vWPNt26KqpbFh6QdcUhpSc3JNU05fMouyzXRYEQs+YEDOBzIR+IRtmJcSSlPhe683JbFlma5KpsE2v4M8HU5j5aTHo+6TKlUeHk7FfSAtZ7Mp+EsnJFMmQKISfQIxBCuLI2ckrkGAdbNCgCCZQsIAhNkABD58VEIGAg9wq+Mi5iPSoTiGABGqL2EWxQ0F2C78bmK3ErIV5ONOL1Br+xcDXRaUA3eSOnJFnckvOySN5+3WWH80IvWzjqZa11JGbDzrmXv9VmXgy2PxQ/emZQQFGIq86enciJryFVtaXdg6f50Znu/0eckKe0P8xuSc3eAOr9KKdztDZoz/8qOglXI/4fRk/wcJARhzKDM4MpsYn4kXVQid0QS9uYxjGYRpykMfpD1wT18518Pv8BX/JX5VbeS7WtMGX4K/fAYilwD0=
    µt|t
    = µt + Kvt
    = µt + (PtZT
    t
    F 1
    t
    )vt
    Pt|t
    = Pt KFtKT
    AAACnHichVHLSsNAFD3G97vqRhGkWCquylRERRBEXQgiaGsf2EpI4qjBvEimBQ3FvT/gwpWCCxF05w+48Qdc9BPEZQU3LrxNA6Ki3jCZc8/cc+cMV3UM3ROMVZuk5pbWtvaOzq7unt6+/sjAYNazS67GM5pt2G5eVTxu6BbPCF0YPO+4XDFVg+fUw+X6ea7MXU+3rS1x5PAdU9m39D1dUwRRcmSkeKAI363IIroQ3ZZFUTX9olmiXI7EWIIFEf0JkiGIIYwNO3KPInZhQ0MJJjgsCMIGFHj0FZAEg0PcDnziXEJ6cM5RQRdpS1TFqUIh9pD++5QVQtaivN7TC9Qa3WLQckkZRZw9sWtWY4/shj2z9197+UGPupcj2tWGljty/+lw+u1flUm7wMGn6k/PAnuYC7zq5N0JmPortIa+fHxWS8+n4v4Eu2Qv5P+CVdkDvcAqv2pXmzx1/ocflbxUaDzJ78P4CbJTieRMYnpzOra4FA6qA6MYxyRNYxaLWMUGMtT9BFe4xZ00Jq1Ia9J6o1RqCjVD+BJS9gPesZpA
    ˆ
    rt = Ztµt
    AAACznichVHLShxBFD22JjFjEkfdBNwMGZRAyFAjEkNAEAMiuBkfoxJbm+q2ZqawX3TVNGjTZBvyAy6ySiALcZVNfiCb/EAQd7oUlwayySK3expCIjG3qKpTp+65dYprh65UmrHTPqN/4NbtO4N3S0P37j8YLo+MrqugGzmi6QRuEG3aXAlX+qKppXbFZhgJ7tmu2LD3Xmb3G7GIlAz8Nb0fim2Pt33Zkg7XRFnlJdP2kji1dGVythJZ+qnZ4TqJiDDN0kKPfmXphqVp3TE9rjuqlaylT0wl2x7fmbISU4RKuoGfWuUqq7E8KtdBvQBVFNEIyp9hYhcBHHThQcCHJuyCQ9HYQh0MIXHbSIiLCMn8XiBFibRdyhKUwYndo7VNp62C9emc1VS52qFXXJoRKSuYYN/YEbtiX9kxu2A//1kryWtkXvZpt3taEVrDbx+u/vivyqNdo/NbdaNnjRae514leQ9zJvuF09PHB4dXqy9WJpJJ9oFdkv/37JR9oR/48Xfn47JYeXeDH5u8ZO2p/92M62B9qlZ/Vptenq7OzReNGsQ4HuExdWMGc1hEA02q/gknOMO50TBiIzVe91KNvkIzhj/CePMLz5WuMA==
    vt = rt ˆ
    rt
    Ft = ZtPtZT
    t
    + 2

    Filtering
    One-step ahead
    prediction
    Prediction of observation Error of observation
    • The Kalman filter continuously estimates the state by repeating one-step ahead
    prediction and filtering.


    • State follows a multivariate normal distribution with mean variance-covariance .
    α
    𝒩
    m
    μ P
    Cycle of state updates in linear Kalman filter
    t = t+1

    View Slide

  26. Arm selection by probability matching method
    26
    Probability matching method using the state estimates from the Kalman filter.


    1. The mean variance-covariance of the state are estimated for each arm.


    2. The parameters for each arm are sampled from with the estimators.


    3. The arm with the largest inner product between the sampled parameter and
    context is selected.
    μt
    Pt
    αt
    ( = θt
    )
    ˜
    θ(k)
    t
    k
    𝒩
    m
    ˜
    θ(k)
    t
    xt

    View Slide

  27. Virtual exploration

    using missing observations

    View Slide

  28. Estimation state using linear Kalman
    fi
    lter (again)
    28
    AAAC6nichVFNSxxBEH2OMdFV40YvQi7iogSUpUeWJAQCklxyXNddFRwdesbetXG+mOld2IzzB3LMJQRPCckh5J4/kEsuHj34E8TjCl4MpGZ2IH6g1tDTr1/Vq35NWYEjI8XY8YA2+GDo4aPhkcLo2PjjieKTybXIb4e2aNi+44cbFo+EIz3RUFI5YiMIBXctR6xbe2/T/HpHhJH0vbrqBmLL5S1PNqXNFVFmsWVYbmy47cSM1YKezL+um+oSta+SRcMoVC9lq32a0LbhcrUbNeN6smBEsuXy7SUzNoTiSc1UtSsFZrHEyiyLmZtAz0EJeVT94i8Y2IEPG224EPCgCDvgiOjbhA6GgLgtxMSFhGSWF0hQIG2bqgRVcGL36N+i02bOenROe0aZ2qZbHFohKWcwx47YD9Zjf9hPdsIubu0VZz1SL13arb5WBObEh+nV83tVLu0Ku/9Vd3pWaOJl5lWS9yBj0lfYfX3n/afe6qvaXDzPvrJT8v+FHbPf9AKvc2Z/XxG1gzv8WOQlHY9+fRg3wdpSWX9erqxUSstv8kEN4ylm8Yym8QLLeIcqGtT9ED1c4K/maB+1z9pBv1QbyDVTuBLat3985rtY
    µt+1 = Ttµt|t
    ,
    Pt+1 = TtPt|t
    TT
    t
    + 2

    RtRT
    t
    AAADBXichVG7SsRQEJ3E9/patVCwCS6KIrvciKgIgiiIsM36WBWNhiTe1WBeJHcXNKaxEfwBCysFCxE7sbWw8Qcs/APFUsHGwkk24At1QnLPnJkzOZdRHUP3GCH3HF9RWVVdU1uXqG9obGpOtrQueHbR1Whesw3bXVIVjxq6RfNMZwZdclyqmKpBF9WtybC+WKKup9vWPNt26KqpbFh6QdcUhpSc3JNU05fMouyzXRYEQs+YEDOBzIR+IRtmJcSSlPhe683JbFlma5KpsE2v4M8HU5j5aTHo+6TKlUeHk7FfSAtZ7Mp+EsnJFMmQKISfQIxBCuLI2ckrkGAdbNCgCCZQsIAhNkABD58VEIGAg9wq+Mi5iPSoTiGABGqL2EWxQ0F2C78bmK3ErIV5ONOL1Br+xcDXRaUA3eSOnJFnckvOySN5+3WWH80IvWzjqZa11JGbDzrmXv9VmXgy2PxQ/emZQQFGIq86enciJryFVtaXdg6f50Znu/0eckKe0P8xuSc3eAOr9KKdztDZoz/8qOglXI/4fRk/wcJARhzKDM4MpsYn4kXVQid0QS9uYxjGYRpykMfpD1wT18518Pv8BX/JX5VbeS7WtMGX4K/fAYilwD0=
    µt|t
    = µt + Kvt
    = µt + (PtZT
    t
    F 1
    t
    )vt
    Pt|t
    = Pt KFtKT
    AAACnHichVHLSsNAFD3G97vqRhGkWCquylRERRBEXQgiaGsf2EpI4qjBvEimBQ3FvT/gwpWCCxF05w+48Qdc9BPEZQU3LrxNA6Ki3jCZc8/cc+cMV3UM3ROMVZuk5pbWtvaOzq7unt6+/sjAYNazS67GM5pt2G5eVTxu6BbPCF0YPO+4XDFVg+fUw+X6ea7MXU+3rS1x5PAdU9m39D1dUwRRcmSkeKAI363IIroQ3ZZFUTX9olmiXI7EWIIFEf0JkiGIIYwNO3KPInZhQ0MJJjgsCMIGFHj0FZAEg0PcDnziXEJ6cM5RQRdpS1TFqUIh9pD++5QVQtaivN7TC9Qa3WLQckkZRZw9sWtWY4/shj2z9197+UGPupcj2tWGljty/+lw+u1flUm7wMGn6k/PAnuYC7zq5N0JmPortIa+fHxWS8+n4v4Eu2Qv5P+CVdkDvcAqv2pXmzx1/ocflbxUaDzJ78P4CbJTieRMYnpzOra4FA6qA6MYxyRNYxaLWMUGMtT9BFe4xZ00Jq1Ia9J6o1RqCjVD+BJS9gPesZpA
    ˆ
    rt = Ztµt
    AAACznichVHLShxBFD22JjFjEkfdBNwMGZRAyFAjEkNAEAMiuBkfoxJbm+q2ZqawX3TVNGjTZBvyAy6ySiALcZVNfiCb/EAQd7oUlwayySK3expCIjG3qKpTp+65dYprh65UmrHTPqN/4NbtO4N3S0P37j8YLo+MrqugGzmi6QRuEG3aXAlX+qKppXbFZhgJ7tmu2LD3Xmb3G7GIlAz8Nb0fim2Pt33Zkg7XRFnlJdP2kji1dGVythJZ+qnZ4TqJiDDN0kKPfmXphqVp3TE9rjuqlaylT0wl2x7fmbISU4RKuoGfWuUqq7E8KtdBvQBVFNEIyp9hYhcBHHThQcCHJuyCQ9HYQh0MIXHbSIiLCMn8XiBFibRdyhKUwYndo7VNp62C9emc1VS52qFXXJoRKSuYYN/YEbtiX9kxu2A//1kryWtkXvZpt3taEVrDbx+u/vivyqNdo/NbdaNnjRae514leQ9zJvuF09PHB4dXqy9WJpJJ9oFdkv/37JR9oR/48Xfn47JYeXeDH5u8ZO2p/92M62B9qlZ/Vptenq7OzReNGsQ4HuExdWMGc1hEA02q/gknOMO50TBiIzVe91KNvkIzhj/CePMLz5WuMA==
    vt = rt ˆ
    rt
    Ft = ZtPtZT
    t
    + 2

    Filtering reduces
    the values of P
    One-step ahead
    prediction
    increases the
    values of P
    Prediction of observation Error of observation
    • The Kalman filter continuously estimates the state by repeating one-step ahead
    prediction and filtering.


    • State follows a multivariate normal distribution with mean variance-covariance .
    α
    𝒩
    m
    μ P
    Cycle of state updates in linear Kalman filter
    t = t+1

    View Slide

  29. Virtual exploration using missing observations
    29
    AAAC6nichVFNSxxBEH2OMdFV40YvQi7iogSUpUeWJAQCklxyXNddFRwdesbetXG+mOld2IzzB3LMJQRPCckh5J4/kEsuHj34E8TjCl4MpGZ2IH6g1tDTr1/Vq35NWYEjI8XY8YA2+GDo4aPhkcLo2PjjieKTybXIb4e2aNi+44cbFo+EIz3RUFI5YiMIBXctR6xbe2/T/HpHhJH0vbrqBmLL5S1PNqXNFVFmsWVYbmy47cSM1YKezL+um+oSta+SRcMoVC9lq32a0LbhcrUbNeN6smBEsuXy7SUzNoTiSc1UtSsFZrHEyiyLmZtAz0EJeVT94i8Y2IEPG224EPCgCDvgiOjbhA6GgLgtxMSFhGSWF0hQIG2bqgRVcGL36N+i02bOenROe0aZ2qZbHFohKWcwx47YD9Zjf9hPdsIubu0VZz1SL13arb5WBObEh+nV83tVLu0Ku/9Vd3pWaOJl5lWS9yBj0lfYfX3n/afe6qvaXDzPvrJT8v+FHbPf9AKvc2Z/XxG1gzv8WOQlHY9+fRg3wdpSWX9erqxUSstv8kEN4ylm8Yym8QLLeIcqGtT9ED1c4K/maB+1z9pBv1QbyDVTuBLat3985rtY
    µt+1 = Ttµt|t
    ,
    Pt+1 = TtPt|t
    TT
    t
    + 2

    RtRT
    t
    One-step ahead
    prediction
    increases the
    values of P
    • The Kalman filter can handle missing observations appropriately.


    • The proposed policy incorporates this process as an update operation for the evaluation
    of the unselected arm in the MAB problem.
    Cycle of processing the missing observations in linear Kalman filter
    t = t+1
    AAACuHichVG7SsRAFD3G9/rYVRvBJrgoVsus+EIQRBvL9bEqGIlJHNfBvEhmFzTsD/gDFlYKFmJvLzaitYWfIJYKNhbezQZFRb1hMmfOvefOGa7p2yKUjD00KI1NzS2tbe2pjs6u7nSmp3c19MqBxYuWZ3vBummE3BYuL0ohbb7uB9xwTJuvmXvztfxahQeh8NwVue/zTccouWJHWIYkSs+Ma6YTaU65qkdSc8S2Kqvq8MwHKTUtVfiSUgu6VPVMluVYHOpPkE9AFkkUvMwlNGzDg4UyHHC4kIRtGAjp20AeDD5xm4iICwiJOM9RRYq0ZariVGEQu0f/Ep02Etalc61nGKstusWmFZBSxRC7Z+fsmd2wC/bI3n7tFcU9al72aTfrWu7r6cP+5dd/VQ7tErufqj89S+xgKvYqyLsfM7VXWHV95eDoeXl6aSgaZqfsifyfsAd2TS9wKy/W2SJfOv7Dj0leqjSe/Pdh/ASro7n8RG5scSw7O5cMqg0DGMQITWMSs1hAAUXqfowr3OJOmVa2lJIi6qVKQ6Lpw5dQgndBlqSe
    µt|t
    = µt
    Pt|t
    = Pt
    Filtering keeps
    the values of P

    View Slide

  30. Virtual exploration using missing observations
    30
    AAAC6nichVFNSxxBEH2OMdFV40YvQi7iogSUpUeWJAQCklxyXNddFRwdesbetXG+mOld2IzzB3LMJQRPCckh5J4/kEsuHj34E8TjCl4MpGZ2IH6g1tDTr1/Vq35NWYEjI8XY8YA2+GDo4aPhkcLo2PjjieKTybXIb4e2aNi+44cbFo+EIz3RUFI5YiMIBXctR6xbe2/T/HpHhJH0vbrqBmLL5S1PNqXNFVFmsWVYbmy47cSM1YKezL+um+oSta+SRcMoVC9lq32a0LbhcrUbNeN6smBEsuXy7SUzNoTiSc1UtSsFZrHEyiyLmZtAz0EJeVT94i8Y2IEPG224EPCgCDvgiOjbhA6GgLgtxMSFhGSWF0hQIG2bqgRVcGL36N+i02bOenROe0aZ2qZbHFohKWcwx47YD9Zjf9hPdsIubu0VZz1SL13arb5WBObEh+nV83tVLu0Ku/9Vd3pWaOJl5lWS9yBj0lfYfX3n/afe6qvaXDzPvrJT8v+FHbPf9AKvc2Z/XxG1gzv8WOQlHY9+fRg3wdpSWX9erqxUSstv8kEN4ylm8Yym8QLLeIcqGtT9ED1c4K/maB+1z9pBv1QbyDVTuBLat3985rtY
    µt+1 = Ttµt|t
    ,
    Pt+1 = TtPt|t
    TT
    t
    + 2

    RtRT
    t
    One-step ahead
    prediction
    increases the
    values of P
    • The Kalman filter can handle missing observations appropriately.


    • The proposed policy incorporates this process as an update operation for the evaluation
    of the unselected arm in the MAB problem.
    Cycle of processing the missing observations in linear Kalman filter
    t = t+1
    AAACuHichVG7SsRAFD3G9/rYVRvBJrgoVsus+EIQRBvL9bEqGIlJHNfBvEhmFzTsD/gDFlYKFmJvLzaitYWfIJYKNhbezQZFRb1hMmfOvefOGa7p2yKUjD00KI1NzS2tbe2pjs6u7nSmp3c19MqBxYuWZ3vBummE3BYuL0ohbb7uB9xwTJuvmXvztfxahQeh8NwVue/zTccouWJHWIYkSs+Ma6YTaU65qkdSc8S2Kqvq8MwHKTUtVfiSUgu6VPVMluVYHOpPkE9AFkkUvMwlNGzDg4UyHHC4kIRtGAjp20AeDD5xm4iICwiJOM9RRYq0ZariVGEQu0f/Ep02Etalc61nGKstusWmFZBSxRC7Z+fsmd2wC/bI3n7tFcU9al72aTfrWu7r6cP+5dd/VQ7tErufqj89S+xgKvYqyLsfM7VXWHV95eDoeXl6aSgaZqfsifyfsAd2TS9wKy/W2SJfOv7Dj0leqjSe/Pdh/ASro7n8RG5scSw7O5cMqg0DGMQITWMSs1hAAUXqfowr3OJOmVa2lJIi6qVKQ6Lpw5dQgndBlqSe
    µt|t
    = µt
    Pt|t
    = Pt
    By introducing this behavior into the policy, the exploration for the unselected arm
    can be encouraged by the mechanism of the probability matching method.
    Filtering keeps
    the values of P

    View Slide

  31. 4.
    Empirical evaluation on real data


    And results

    View Slide

  32. 32
    Evaluation setting
    • The data were collected from an actual EC site.


    • Each product is assigned to one of 18 categories.


    • Four recommendation methods are used.


    • Counts of the recommendations of each method and the clicks for each recommendation
    in each product category hourly from June 20, 2019, through July 22, 2019.


    • The total number of recommendations is 1,488,626.
    0 100 200 300 400 500 600
    Hours
    0.05
    0.10
    0.15
    Click-through rate
    Browsing path
    Demographic
    LLR
    Similar image
    The highest CTR switches.


    Typical characteristics of the
    recommendation method appears.
    Fig. 2. Change of CTR for each recommendation method.

    View Slide

  33. 33
    Evaluation setting
    • We simulated the number of clicks obtained by the selected recommendation
    method during the time that the data were acquired.


    • Assumptions


    • The user clicks for recommendations follow the Bernoulli distribution of the CTR,
    where recommendation methods are selected by the MAB policy.


    • Each recommendation method has an 18-dimensional parameter equal to the
    number of product categories.


    • The CTR is calculated by the inner product of and context .


    • The context is represented as a 1-hot vector of the product category that the user
    is browsing at time .
    ˜
    θ(k)
    t
    ˜
    θ(k)
    t
    xt
    xt
    t

    View Slide

  34. Effect of context and temporal change
    34
    0
    5000
    10000
    15000
    20000
    Cumulative regret
    adaptive thompson sampling
    decay linucb
    dynamic linucb
    LGSS bandits
    linear thompson sampling
    linucb
    neural linear
    select best arm at first
    select random arm
    0 100 200 300 400 500 600
    0.0
    0.2
    0.4
    0.6
    0.8
    1.0
    Best arm rate
    • The proposed LGSS bandits achieved the highest cumulative reward, 201,330.4.


    • LGSS bandits and adaptive Thompson sampling reduce overall opportunity loss, not
    only during best arm switching but also during stationary periods.

    View Slide

  35. Effect of context and temporal change
    35
    0.00
    0.05
    0.10
    0.15
    0.20
    Change of click-through rate (Plants)
    Browsing path
    Demographic
    LLR
    Similar image
    0 100 200 300 400 500 600
    0.0
    0.2
    0.4
    0.6
    0.8
    1.0
    Best arm rate (Plants)
    adaptive thompson sampling
    dynamic linucb
    LGSS bandits
    linear thompson sampling
    0.00
    0.05
    0.10
    0.15
    0.20
    Change of click-through rate (Stationery)
    Browsing path
    Demographic
    LLR
    Similar image
    0 100 200 300 400 500 600
    0.0
    0.2
    0.4
    0.6
    0.8
    1.0
    Best arm rate (Stationery)
    adaptive thompson sampling
    dynamic linucb
    LGSS bandits
    linear thompson sampling
    0 50 100 150 200
    0.0
    0.2
    0.4
    0.6
    0.8
    1.0
    Period 1
    240 260 280 300
    Period 2
    300 400 500 600
    Period 3
    • Comparison of the ratio of optimal arms in categories where the recommendation
    method with the highest CTR switches and categories where it does not.


    • The proposed policy has been shown to continuously select the best arm in both
    situations.

    View Slide

  36. Effect of the response time
    36
    0.0
    0.2
    0.4
    ms
    0.48 0.19 0.04 0.24 0.29 0.04 0.27
    Select per 1 time
    10°1
    100
    sec (Log)
    1.49 0.04 0.06 0.09 0.04 0.05 1.82
    Update per 2665 times
    adaptive thompson sampling
    decay linucb
    dynamic linucb
    LGSS bandits
    linear thompson sampling
    linucb
    neural linear
    Elapsed time for 2665 times for 18 dimension 4 arms
    • We measured the time elapsed from the arm
    selection and evaluation updates for each policy.


    • For arm selection, the LGSS bandits had an
    elapsed time of 0.24 ms.


    • All measures were less than 0.5 ms and
    considered to have sufficient performance
    without significant impact.


    • Measures of evaluation updates, the elapsed time
    was sufficiently short compared with the unit time
    (1 h).


    • Some policy are not scalable to an increase in
    the number of trials per unit time.

    View Slide

  37. 5.
    Conclusion

    View Slide

  38. • We proposed an LGSS-based MAB policy to optimize the selection of
    recommendation methods continuously and quickly according to the context and
    temporal change using the Kalman filter for meta-RSs.


    • The experimental results demonstrated that the proposed policy can effectively reduce
    opportunity loss during evaluation and increase the cumulative number of clicks
    compared with baseline policies.


    • The results also revealed that the proposed policy has no overhead on the response
    time of the meta-RS.


    • In the future, we will develop general-purpose policies that can quickly handle
    nonlinearity.
    38
    Conclusion

    View Slide

  39. View Slide