Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Resilient Elixir Systems

Building Resilient Elixir Systems

Presented at GigCity Elixir - 2018

This was my attempt at describing a methodology for building systems in elixir that can handle failures at all levels. It touches on technology solutions as well as how to engage humans in those solutions.

Chris Keathley

October 27, 2018
Tweet

More Decks by Chris Keathley

Other Decks in Programming

Transcript

  1. Building resilient
    systems with stacking
    Chris Keathley / @ChrisKeathley / [email protected]

    View Slide

  2. Breaking resilient
    systems with stacking
    Chris Keathley / @ChrisKeathley / [email protected]

    View Slide

  3. Purely functional data
    structures explained
    Chris Keathley / @ChrisKeathley / [email protected]

    View Slide

  4. How to build reliable
    systems with your face
    (and not on your face)
    Chris Keathley / @ChrisKeathley / [email protected]

    View Slide

  5. HOw to boot your
    apps correctly
    Chris Keathley / @ChrisKeathley / [email protected]

    View Slide

  6. Scaling

    View Slide

  7. Scaling

    View Slide

  8. Scaling
    BEAM

    View Slide

  9. Resilience
    an ability to recover from or adjust easily to
    Misfortune or change
    /ri-ˈzil-yən(t)s/

    View Slide

  10. View Slide

  11. Complex systems run in degraded mode.
    “…complex systems run as broken systems. The system continues to function because it contains
    so many redundancies and because people can make it function, despite the presence of many
    flaws… System operations are dynamic, with components (organizational, human, technical) failing
    and being replaced continuously.”

    View Slide

  12. System
    A group of interacting, interrelated, or
    interdependent elements forming a complex whole.
    /ˈsistəm/

    View Slide

  13. Systems have
    dependencies

    View Slide

  14. Systems

    View Slide

  15. Our App
    Systems

    View Slide

  16. Our App
    Webserver
    Systems

    View Slide

  17. Our App
    Webserver
    DB
    Systems

    View Slide

  18. Our App
    Webserver
    DB
    Redis
    Systems

    View Slide

  19. Our App
    Webserver
    DB
    Redis
    Kafka
    Systems

    View Slide

  20. Our App
    Systems

    View Slide

  21. Our App
    Systems

    View Slide

  22. Systems

    View Slide

  23. Systems
    Our App

    View Slide

  24. Systems
    Our App
    Other
    Service
    Other
    Service
    Other
    Service
    Other
    Service
    Other
    Service
    Other
    Service

    View Slide

  25. Scaling is a problem
    of handling failure

    View Slide

  26. Our App
    Systems
    Other
    Service
    Client

    View Slide

  27. Our App
    Systems
    Other
    Service
    Client

    View Slide

  28. Our App
    Systems
    Other
    Service
    Client

    View Slide

  29. Our App
    Systems
    Other
    Service
    Client

    View Slide

  30. Our App
    Systems
    Other
    Service
    Client

    View Slide

  31. Our App
    Systems
    Other
    Service
    Client

    View Slide

  32. Our App
    Systems
    Other
    Service
    Client

    View Slide

  33. Our App
    Systems
    Other
    Service
    Client

    View Slide

  34. Our App
    Systems
    Other
    Service
    Client

    View Slide

  35. Our App
    Systems
    Other
    Service
    Client

    View Slide

  36. Dependencies are
    more then other
    systems

    View Slide

  37. Systems
    Our App

    View Slide

  38. Systems
    Our App
    Humans!

    View Slide

  39. Handle failures gracefully
    Provide feedback to other systems
    Give insight to operators
    Systems Should…

    View Slide

  40. Our App
    Webserver
    DB
    Redis
    Kafka

    View Slide

  41. Our App
    Webserver
    DB
    Redis
    Kafka
    Stacked
    Design

    View Slide

  42. Lets talk about…

    View Slide

  43. Lets talk about…
    Booting the runtime & Configuration
    Starting dependencies
    Connecting to external systems
    Alarms and feedback
    Communicating with services we don’t control

    View Slide

  44. Lets talk about…
    Booting the runtime & Configuration
    Starting dependencies
    Connecting to external systems
    Alarms and feedback
    Communicating with services we don’t control

    View Slide

  45. Server

    View Slide

  46. Kubernetes

    View Slide

  47. Kubernetes
    Release

    View Slide

  48. Our App
    Webserver
    DB
    Redis
    Kafka

    View Slide

  49. Our App

    View Slide

  50. Releases are the unit
    of deployment in
    Erlang/Elixir

    View Slide

  51. What has to be here
    to start our
    application?

    View Slide

  52. App Boot

    View Slide

  53. App Boot
    Read in system
    configuration

    View Slide

  54. App Boot
    Read in system
    configuration Start the BEAM

    View Slide

  55. App Boot
    Read in system
    configuration Start the BEAM Start the App

    View Slide

  56. App Boot
    Start the App
    Read runtime
    configuration

    View Slide

  57. App Boot
    Start the App
    Read runtime
    configuration
    Proceed to next
    level

    View Slide

  58. App Boot
    Start the App
    Read runtime
    configuration
    Proceed to next
    level

    View Slide

  59. Mix config
    vs.
    runtime config

    View Slide

  60. View Slide

  61. defmodule Jenga.Application do
    use Application
    def start(_type, _args) do
    children = [
    ]
    opts = [strategy: :one_for_one, name: Jenga.Supervisor]
    Supervisor.start_link(children, opts)
    end
    end

    View Slide

  62. defmodule Jenga.Application do
    use Application
    def start(_type, _args) do
    config = [
    port: "PORT",
    db_url: "DB_URL",
    ]
    children = [
    ]
    opts = [strategy: :one_for_one, name: Jenga.Supervisor]
    Supervisor.start_link(children, opts)
    end
    end

    View Slide

  63. defmodule Jenga.Application do
    use Application
    def start(_type, _args) do
    config = [
    port: "PORT",
    db_url: "DB_URL",
    ]
    children = [
    {Jenga.Config, config},
    ]
    opts = [strategy: :one_for_one, name: Jenga.Supervisor]
    Supervisor.start_link(children, opts)
    end
    end

    View Slide

  64. defmodule Jenga.Config do
    end

    View Slide

  65. defmodule Jenga.Config do
    use GenServer
    def start_link(desired_config) do
    GenServer.start_link(__MODULE__, desired_config, name: __MODULE__)
    end
    end

    View Slide

  66. defmodule Jenga.Config do
    use GenServer
    def start_link(desired_config) do
    GenServer.start_link(__MODULE__, desired_config, name: __MODULE__)
    end
    def init(desired) do
    :jenga_config = :ets.new(:jenga_config, [:set, :protected, :named_table])
    end
    end

    View Slide

  67. defmodule Jenga.Config do
    use GenServer
    def start_link(desired_config) do
    GenServer.start_link(__MODULE__, desired_config, name: __MODULE__)
    end
    def init(desired) do
    :jenga_config = :ets.new(:jenga_config, [:set, :protected, :named_table])
    case load_config(:jenga_config, desired) do
    :ok ->
    {:ok, %{table: :jenga_config, desired: desired}}
    end
    end
    end

    View Slide

  68. defmodule Jenga.Config do
    use GenServer
    def start_link(desired_config) do
    GenServer.start_link(__MODULE__, desired_config, name: __MODULE__)
    end
    def init(desired) do
    :jenga_config = :ets.new(:jenga_config, [:set, :protected, :named_table])
    case load_config(:jenga_config, desired) do
    :ok ->
    {:ok, %{table: :jenga_config, desired: desired}}
    :error ->
    {:stop, :could_not_load_config}
    end
    end
    end

    View Slide

  69. defmodule Jenga.Config do
    use GenServer
    def start_link(desired_config) do
    GenServer.start_link(__MODULE__, desired_config, name: __MODULE__)
    end
    def init(desired) do
    :jenga_config = :ets.new(:jenga_config, [:set, :protected, :named_table])
    case load_config(:jenga_config, desired) do
    :ok ->
    {:ok, %{table: :jenga_config, desired: desired}}
    :error ->
    {:stop, :could_not_load_config}
    end
    end
    defp load_config(table, config, retry_count \\ 0)
    defp load_config(_table, [], _), do: :ok
    defp load_config(_table, _, 10), do: :error
    defp load_config(table, [{k, v} | tail], retry_count) do
    case System.get_env(v) do
    nil ->
    load_config(table, [{k, v} | tail], retry_count + 1)
    value ->
    :ets.insert(table, {k, value})
    load_config(table, tail, retry_count)
    end
    end
    end

    View Slide

  70. ** (Mix) Could not start application jenga:
    Jenga.Application.start(:normal, [])
    returned an error: shutdown: failed to start child: Jenga.Config
    ** (EXIT) :could_not_load_config

    View Slide

  71. Lets talk about…
    Booting the runtime & Configuration
    Starting dependencies
    Connecting to external systems
    Alarms and feedback
    Communicating with services we don’t control

    View Slide

  72. Lets talk about…
    Booting the runtime & Configuration
    Starting dependencies
    Connecting to external systems
    Alarms and feedback
    Communicating with services we don’t control

    View Slide

  73. App

    View Slide

  74. App
    Load Balancer /up

    View Slide

  75. App
    Load Balancer /up
    Operators alarms

    View Slide

  76. App

    View Slide

  77. App

    View Slide

  78. App
    Phoenix

    View Slide

  79. defmodule Jenga.Application do
    use Application
    def start(_type, _args) do
    config = [
    port: "PORT",
    db_url: "DB_URL",
    ]
    children = [
    {Jenga.Config, config},
    ]
    opts = [strategy: :one_for_one, name: Jenga.Supervisor]
    Supervisor.start_link(children, opts)
    end
    end

    View Slide

  80. defmodule Jenga.Application do
    use Application
    def start(_type, _args) do
    config = [
    port: "PORT",
    db_url: "DB_URL",
    ]
    children = [
    {Jenga.Config, config},
    JengaWeb.Endpoint,
    ]
    opts = [strategy: :one_for_one, name: Jenga.Supervisor]
    Supervisor.start_link(children, opts)
    end
    end

    View Slide

  81. defmodule JengaWeb.Endpoint do
    use Phoenix.Endpoint, otp_app: :jenga
    def init(_key, config) do
    port = Jenga.Config.get(:port)
    {:ok, Keyword.put(config, :http, [:inet6, port: port])}
    end
    end

    View Slide

  82. defmodule JengaWeb.UpController do
    use JengaWeb, :controller
    def up(conn, _params) do
    {code, message} = status()
    conn
    |> Plug.Conn.put_status(code)
    |> json(message)
    end
    defp status do
    {500, %{status: “LOADING”}}
    end
    end

    View Slide

  83. Lets talk about…
    Booting the runtime & Configuration
    Starting dependencies
    Connecting to external systems
    Alarms and feedback
    Communicating with services we don’t control

    View Slide

  84. Lets talk about…
    Booting the runtime & Configuration
    Starting dependencies
    Connecting to external systems
    Alarms and feedback
    Communicating with services we don’t control

    View Slide

  85. App
    Phoenix

    View Slide

  86. App
    Phoenix Database

    View Slide

  87. App
    Phoenix
    Pool
    Supervisor
    Conn Conn Conn

    View Slide

  88. App
    Phoenix
    Pool
    Supervisor
    Conn Conn Conn
    Disconnected

    View Slide

  89. Supervisors are
    about guarantees
    -“Friend of the show” Fred Hebert

    View Slide

  90. App
    Phoenix
    Pool
    Supervisor
    Conn Conn Conn

    View Slide

  91. App
    Phoenix
    Pool
    Supervisor
    Conn Conn Conn

    View Slide

  92. App
    Phoenix
    Pool
    Supervisor
    Conn Conn Conn

    View Slide

  93. App
    Phoenix
    Pool
    Supervisor
    Conn Conn Conn

    View Slide

  94. App
    Phoenix
    Pool
    Supervisor
    Conn Conn Conn

    View Slide

  95. defmodule Jenga.DemoConnection do
    use GenServer
    end

    View Slide

  96. defmodule Jenga.DemoConnection do
    use GenServer
    def init(opts) do
    wait_for = 3_000 + backoff() + jitter()
    Process.send_after(self(), {:try_connect, opts}, wait_for)
    {:ok, %{state: :disconnected}}
    end
    end

    View Slide

  97. defmodule Jenga.DemoConnection do
    use GenServer
    def init(opts) do
    wait_for = 3_000 + backoff() + jitter()
    Process.send_after(self(), {:try_connect, opts}, wait_for)
    {:ok, %{state: :disconnected}}
    end
    def handle_info({:try_connect, opts}, _) do
    do_connect(opts)
    {:noreply, state}
    end
    end

    View Slide

  98. defmodule Jenga.DemoConnection do
    use GenServer
    def init(opts) do
    wait_for = 3_000 + backoff() + jitter()
    Process.send_after(self(), {:try_connect, opts}, wait_for)
    {:ok, %{state: :disconnected}}
    end
    def handle_info(:try_connect, state) do
    case do_connect do
    :ok ->
    {:noreply, %{state | state: :connected}}
    :error ->
    wait_for = 3_000 + backoff() + jitter()
    Process.send_after(self(), :try_connect, wait_for)
    {:noreply, state}
    end
    end
    end

    View Slide

  99. Lets talk about…
    Booting the runtime & Configuration
    Starting dependencies
    Connecting to external systems
    Alarms and feedback
    Communicating with services we don’t control

    View Slide

  100. Lets talk about…
    Booting the runtime & Configuration
    Starting dependencies
    Connecting to external systems
    Alarms and feedback
    Communicating with services we don’t control

    View Slide

  101. App
    Phoenix
    Pool
    Supervisor
    Conn Conn Conn

    View Slide

  102. App
    Phoenix
    Pool
    Supervisor
    Conn Conn Conn

    View Slide

  103. App
    Phoenix
    Pool
    Supervisor
    Conn Conn Conn

    View Slide

  104. App
    Phoenix
    Pool
    Supervisor
    Conn Conn Conn

    View Slide

  105. App
    Phoenix
    Pool
    Supervisor
    Conn Conn Conn

    View Slide

  106. App
    Phoenix
    Pool
    Supervisor
    Conn Conn Conn
    Load Balancer

    View Slide

  107. defmodule JengaWeb.UpController do
    use JengaWeb, :controller
    def up(conn, _params) do
    {code, message} = status()
    conn
    |> Plug.Conn.put_status(code)
    |> json(message)
    end
    defp status do
    {500, %{status: “LOADING”}}
    end
    end

    View Slide

  108. defmodule JengaWeb.UpController do
    use JengaWeb, :controller
    def up(conn, _params) do
    {code, message} = status()
    conn
    |> Plug.Conn.put_status(code)
    |> json(message)
    end
    defp status do
    case Database.check_status() do
    :ok ->
    {200, %{status: "OK"}}
    _ ->
    {500, %{status: "LOADING"}}
    end
    end
    end

    View Slide

  109. App
    Phoenix
    Pool
    Supervisor
    Conn Conn Conn
    Load Balancer

    View Slide

  110. App
    Phoenix
    Pool
    Supervisor
    Conn Conn Conn

    View Slide

  111. App
    Phoenix
    Pool
    Supervisor
    Conn Conn Conn
    Operators alarms

    View Slide

  112. App
    Phoenix
    Pool
    supervisor
    Operators alarms
    db_supervisor
    Watchdog

    View Slide

  113. Watchdog

    View Slide

  114. Watchdog
    Good Bad
    Check DB Status

    View Slide

  115. Watchdog
    Good Bad
    Check DB Status Open alarm

    View Slide

  116. Watchdog
    Good Bad
    Check DB Status
    Close alarm
    Open alarm

    View Slide

  117. defmodule Jenga.Database.Watchdog do
    use GenServer
    end

    View Slide

  118. defmodule Jenga.Database.Watchdog do
    use GenServer
    def init(:ok) do
    schedule_check()
    {:ok, %{status: :degraded, passing_checks: 0}}
    end
    end

    View Slide

  119. defmodule Jenga.Database.Watchdog do
    use GenServer
    def init(:ok) do
    schedule_check()
    {:ok, %{status: :degraded, passing_checks: 0}}
    end
    def handle_info(:check_db, state) do
    status = Jenga.Database.check_status()
    state = change_state(status, state)
    schedule_check()
    {:noreply, state}
    end
    end

    View Slide

  120. defmodule Jenga.Database.Watchdog do
    use GenServer
    def init(:ok) do
    schedule_check()
    {:ok, %{status: :degraded, passing_checks: 0}}
    end
    def handle_info(:check_db, state) do
    status = Jenga.Database.check_status()
    state = change_state(status, state)
    schedule_check()
    {:noreply, state}
    end
    defp change_state(result, %{status: status, passing_checks: count}) do
    end
    end

    View Slide

  121. defmodule Jenga.Database.Watchdog do
    use GenServer
    def init(:ok) do
    schedule_check()
    {:ok, %{status: :degraded, passing_checks: 0}}
    end
    def handle_info(:check_db, state) do
    status = Jenga.Database.check_status()
    state = change_state(status, state)
    schedule_check()
    {:noreply, state}
    end
    defp change_state(result, %{status: status, passing_checks: count}) do
    case {result, status, count} do
    {:ok, :connected, count} ->
    if count == 3 do
    :alarm_handler.clear_alarm(@alarm_id)
    end
    %{status: :connected, passing_checks: count + 1}
    {:ok, :degraded, _} ->
    %{status: :connected, passing_checks: 0}
    end
    end
    end

    View Slide

  122. defmodule Jenga.Database.Watchdog do
    use GenServer
    def init(:ok) do
    schedule_check()
    {:ok, %{status: :degraded, passing_checks: 0}}
    end
    def handle_info(:check_db, state) do
    status = Jenga.Database.check_status()
    state = change_state(status, state)
    schedule_check()
    {:noreply, state}
    end
    defp change_state(result, %{status: status, passing_checks: count}) do
    case {result, status, count} do
    {:ok, :connected, count} ->
    if count == 3 do
    :alarm_handler.clear_alarm(@alarm_id)
    end
    %{status: :connected, passing_checks: count + 1}
    {:ok, :degraded, _} ->
    %{status: :connected, passing_checks: 0}
    {:error, :connected, _} ->
    :alarm_handler.set_alarm({@alarm_id, "We cannot connect to the database”})
    %{status: :degraded, passing_checks: 0}
    {:error, :degraded, _} ->
    %{status: :degraded, passing_checks: 0}
    end
    end
    end

    View Slide

  123. :alarm_handler.clear_alarm(@alarm_id)
    :alarm_handler.set_alarm({@alarm_id, "We cannot connect to the database”})

    View Slide

  124. defmodule Jenga.Application do
    use Application
    def start(_type, _args) do
    config = [
    port: “PORT",
    db_url: "DB_URL",
    ]
    children = [
    {Jenga.Config, config},
    JengaWeb.Endpoint,
    Jenga.Database.Supervisor,
    ]
    opts = [strategy: :one_for_one, name: Jenga.Supervisor]
    Supervisor.start_link(children, opts)
    end
    end

    View Slide

  125. defmodule Jenga.Application do
    use Application
    def start(_type, _args) do
    config = [
    port: “PORT",
    db_url: "DB_URL",
    ]
    :gen_event.swap_handler(
    :alarm_handler,
    {:alarm_handler, :swap},
    {Jenga.AlarmHandler, :ok})
    children = [
    {Jenga.Config, config},
    JengaWeb.Endpoint,
    Jenga.Database.Supervisor,
    ]
    opts = [strategy: :one_for_one, name: Jenga.Supervisor]
    Supervisor.start_link(children, opts)
    end
    end

    View Slide

  126. defmodule Jenga.AlarmHandler do
    require Logger
    def init({:ok, {:alarm_handler, _old_alarms}}) do
    Logger.info("Installing alarm handler")
    {:ok, %{}}
    end
    def handle_event({:set_alarm, :database_disconnected}, alarms) do
    send_alert_to_slack(database_alarm())
    {:ok, alarms}
    end
    def handle_event({:clear_alarm, :database_disconnected}, alarms) do
    send_recovery_to_slack(database_alarm())
    {:ok, alarms}
    end
    def handle_event(event, state) do
    Logger.info("Unhandled alarm event: #{inspect(event)}")
    {:ok, state}
    end
    end

    View Slide

  127. Lets talk about…
    Booting the runtime & Configuration
    Starting dependencies
    Connecting to external systems
    Alarms and feedback
    Communicating with services we don’t control

    View Slide

  128. Lets talk about…
    Booting the runtime & Configuration
    Starting dependencies
    Connecting to external systems
    Alarms and feedback
    Communicating with services we don’t control

    View Slide

  129. App
    Other
    Service
    Client
    External Services

    View Slide

  130. App
    Other
    Service
    Client
    External Services

    View Slide

  131. App
    Other
    Service
    Client
    External Services

    View Slide

  132. App
    Other
    Service
    Client
    External Services

    View Slide

  133. App
    Other
    Service
    Client
    External Services

    View Slide

  134. App
    Other
    Service
    Client
    External Services

    View Slide

  135. App
    Other
    Service
    Client
    External Services

    View Slide

  136. Circuit
    Breakers

    View Slide

  137. defmodule Jenga.ExternalService do
    def fetch(params) do
    with :ok <- :fuse.ask(@fuse, :async_dirty),
    {:ok, result} <- make_call(params) do
    {:ok, result}
    else
    {:error, e} ->
    :ok = :fuse.melt(@fuse)
    {:error, e}
    :blown ->
    {:error, :service_is_down}
    end
    end
    end

    View Slide

  138. App
    Other
    Service
    Client
    External Services

    View Slide

  139. App
    Other
    Service
    Client
    External Services

    View Slide

  140. App
    Other
    Service
    Client
    External Services
    ETS

    View Slide

  141. App
    Other
    Service
    Client
    External Services
    ETS

    View Slide

  142. App
    Other
    Service
    Client
    External Services
    ETS

    View Slide

  143. App
    Other
    Service
    Client
    External Services
    ETS

    View Slide

  144. Circuit
    Breakers

    View Slide

  145. Additive Increase
    Multiplicative Decrease

    View Slide

  146. Lets talk about…
    Booting the runtime & Configuration
    Starting dependencies
    Connecting to external systems
    Alarms and feedback
    Communicating with services we don’t control

    View Slide

  147. We booted our
    application!

    View Slide

  148. Now what?

    View Slide

  149. Handle failures gracefully
    Provide feedback to other systems
    Give insight to operators
    Systems Should…

    View Slide

  150. Handle failures gracefully
    Provide feedback to other systems
    Give insight to operators
    Systems Should…

    View Slide

  151. Handle failures gracefully
    Provide feedback to other systems
    Give insight to operators
    Systems Should…

    View Slide

  152. Handle failures gracefully
    Provide feedback to other systems
    Give insight to operators
    Systems Should…

    View Slide

  153. Handle failures gracefully
    Provide feedback to other systems
    Give insight to operators
    Systems Should…

    View Slide

  154. We have powerful tools in our runtime

    View Slide

  155. Take advantage of them to build more
    robust systems

    View Slide

  156. Thanks
    Chris Keathley / @ChrisKeathley / keathey.io

    View Slide