Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deconstructing an Abstraction to Reconstruct an Outage (SREcon EMEA 2023 edition)

Deconstructing an Abstraction to Reconstruct an Outage (SREcon EMEA 2023 edition)

We all rely on abstractions to build the applications we use day-to-day. It's easy for those abstractions to feel like impenetrable walls, hiding scary low-level parts of the system - especially for a complex piece of software like a database. That needn't be the case!

In this talk, we'll explore the aftermath of a complex outage in a Postgres cluster. We'll retrace the steps we took to reliably reproduce the failure in a local environment and pull out lessons about debugging complex systems along the way. At one point, we'll dive into the depths of how Postgres represents data on disk and realise that even unfamiliar layers of a system don't need to be scary.

Chris Sinjakli

October 12, 2023
Tweet

More Decks by Chris Sinjakli

Other Decks in Programming

Transcript

  1. Deconstructing an
    Abstraction to


    Reconstruct


    an Outage sinjo.dev

    View Slide

  2. A familiar


    story 📚

    View Slide

  3. View Slide

  4. View Slide

  5. View Slide

  6. View Slide

  7. 2xx


    5xx
    Percentage
    Time
    API response status

    View Slide

  8. DB::ConnectionFailure - could

    not connect to server:

    Connection refused
    💥

    View Slide

  9. View Slide

  10. Hi

    View Slide

  11. sinjo.dev

    View Slide

  12. sinjo.dev

    View Slide

  13. Infra Engineer

    View Slide

  14. Databases &


    Distributed Systems


    😍

    View Slide

  15. View Slide

  16. View Slide

  17. Deconstructing an
    Abstraction to


    Reconstruct


    an Outage sinjo.dev

    View Slide

  18. First:


    Our cluster setup

    View Slide

  19. Postgres
    API backend

    View Slide

  20. Postgres
    Postgres
    Postgres
    Repl Repl
    API backend

    View Slide

  21. Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    API backend

    View Slide

  22. Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    VIP
    API backend

    View Slide

  23. Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    VIP
    API backend

    View Slide

  24. Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    VIP
    API backend

    View Slide

  25. Postgres
    Postgres
    Postgres
    Repl
    Pacemaker Pacemaker Pacemaker
    API backend
    VIP

    View Slide

  26. Postgres
    Postgres
    Postgres
    Repl
    VIP
    Pacemaker Pacemaker Pacemaker
    API backend

    View Slide

  27. Postgres
    Postgres
    Postgres
    Repl
    VIP
    Pacemaker Pacemaker Pacemaker
    API backend

    View Slide

  28. Note:


    One replica

    always synchronous

    View Slide

  29. So...

    View Slide

  30. So...
    Unfortunately...

    View Slide

  31. Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    VIP
    API backend

    View Slide

  32. Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    VIP
    API backend

    View Slide

  33. Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    VIP
    API backend

    View Slide

  34. Except it


    didn't

    View Slide

  35. Our API


    was down

    View Slide

  36. Fallback:


    fully manual setup

    View Slide

  37. Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP
    API backend
    👩💻

    View Slide

  38. Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP
    API backend
    👩💻

    View Slide

  39. Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP
    API backend
    👩💻
    Repl

    View Slide

  40. Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    👩💻
    Repl
    API backend

    View Slide

  41. Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    👩💻
    Repl
    API backend
    Repl

    View Slide

  42. We're safe,


    for now...

    View Slide

  43. But only one


    failure away


    from downtime

    View Slide

  44. Mission:


    Recreate the outage

    View Slide

  45. There's a lot


    We'll go step-by-step

    View Slide

  46. 1. RAID array loses disks


    2. Kernel sets
    fi
    lesystem read-only


    3. Pacemaker detects primary failure


    4. Synchronous replica crash


    5. Suspicious log on synchronous replica

    View Slide

  47. 1. RAID array loses disks


    2. Kernel sets
    fi
    lesystem read-only


    3. Pacemaker detects primary failure


    4. Synchronous replica crash


    5. Suspicious log on synchronous replica

    View Slide

  48. 1. RAID array loses disks


    2. Kernel sets
    fi
    lesystem read-only


    3. Pacemaker detects primary failure


    4. Synchronous replica crash


    5. Suspicious log on synchronous replica

    View Slide

  49. 1. RAID array loses disks


    2. Kernel sets
    fi
    lesystem read-only


    3. Pacemaker detects primary failure


    4. Synchronous replica crash


    5. Suspicious log on synchronous replica

    View Slide

  50. 1. RAID array loses disks


    2. Kernel sets
    fi
    lesystem read-only


    3. Pacemaker detects primary failure


    4. Synchronous replica crash


    5. Suspicious log on synchronous replica

    View Slide

  51. 2023-02-24 17:23:01 GMT LOG: restored log file
    "000000020000000000000003" from archive


    2023-02-24 17:23:02 GMT LOG: invalid record length
    at 0/3000180
    Suspicious log on synchronous replica

    View Slide

  52. 1. RAID array loses disks


    2. Kernel sets
    fi
    lesystem read-only


    3. Pacemaker detects primary failure


    4. Synchronous replica crash


    5. Suspicious log on synchronous replica

    View Slide

  53. 1. RAID array loses disks


    2. Kernel sets
    fi
    lesystem read-only


    3. Pacemaker detects primary failure


    4. Synchronous replica crash


    5. Suspicious log on synchronous replica

    View Slide

  54. 1. RAID array loses disks


    2. Kernel sets
    fi
    lesystem read-only


    3. Pacemaker detects primary failure


    4. Synchronous replica crash


    5. Suspicious log on synchronous replica

    View Slide

  55. Everyone's
    favourite fault-
    injection tool

    View Slide

  56. You know


    it well...

    View Slide

  57. KILL(1) General Commands Manual KILL(1)


    NAME


    kill – terminate or signal a process


    SYNOPSIS


    kill [-s signal_name] pid ...


    kill -l [exit_status]


    kill -signal_name pid ...


    kill -signal_number pid ...


    DESCRIPTION


    The kill utility sends a signal to the processes


    specified by the pid operands.


    Only the super-user may send signals to other users'


    processes.


    The options are as follows:

    View Slide

  58. View Slide

  59. # on primary - hard kill


    kill -SIGKILL


    # on synchronous replica - subprocess crash


    kill -SIGABRT

    View Slide

  60. # on primary - hard kill


    kill -SIGKILL


    # on synchronous replica - subprocess crash


    kill -SIGABRT

    View Slide

  61. We kept our


    expectations


    low...

    View Slide

  62. ...which was


    the right


    choice

    View Slide

  63. 1. RAID array loses disks


    2. Kernel sets
    fi
    lesystem read-only


    3. Pacemaker detects primary failure


    4. Synchronous replica crash


    5. Suspicious log on synchronous replica

    View Slide

  64. 1. RAID array loses disks


    2. Kernel sets
    fi
    lesystem read-only


    3. Pacemaker detects primary failure


    4. Synchronous replica crash


    5. Suspicious log on synchronous replica

    View Slide

  65. 2023-02-24 17:23:01 GMT LOG: restored log file
    "000000020000000000000003" from archive


    2023-02-24 17:23:02 GMT LOG: invalid record length
    at 0/3000180
    Suspicious log on synchronous replica

    View Slide

  66. What do we
    mean by "log"?

    View Slide

  67. [2023-02-26 23:02:37Z] GET / - 200


    [2023-02-26 23:02:49Z] GET /favicon.ico - 200


    [2023-02-26 23:02:52Z] POST /login - 200


    [2023-02-26 23:33:52Z] POST /posts - 201


    [2023-02-26 23:33:57Z] GET /posts/binary-logs—talk - 200
    What we normally mean by logs

    View Slide

  68. A different kind
    of log:


    binary logs

    View Slide

  69. INSERT INTO users VALUES ('codd');


    INSERT INTO users VALUES ('lovelace');


    INSERT INTO users VALUES ('turing');
    Some extremely boring SQL

    View Slide

  70. View Slide

  71. View Slide

  72. Warning:


    simplifying lie ahead

    View Slide

  73. INSERT INTO users VALUES ('codd');


    INSERT INTO users VALUES ('lovelace');


    INSERT INTO users VALUES ('turing');





    Wrote 'codd' into table 'users'


    Wrote 'lovelace' into table 'users'


    Wrote 'turing' into table 'users'
    A different kind of logs

    (if they were textual)

    View Slide

  74. Postgres calls these
    "Write Ahead Logs"


    (WALs)

    View Slide

  75. But why bother
    doing that?

    View Slide

  76. Crash safety

    View Slide

  77. Index Table
    id username
    1 codd
    2 lovelace
    id
    1
    2

    View Slide

  78. Index Table
    id username
    1 codd
    2 lovelace
    3 turing
    id
    1
    2

    View Slide

  79. Index Table
    id
    1
    2
    💥 .
    id username
    1 codd
    2 lovelace
    3 turing

    View Slide

  80. Index Table
    id
    1
    2
    ???
    id username
    1 codd
    2 lovelace
    3 turing

    View Slide

  81. We can replay this operation
    INSERT INTO users VALUES ('codd');


    INSERT INTO users VALUES ('lovelace');


    INSERT INTO users VALUES ('turing');





    Wrote 'codd' into table 'users'


    Wrote 'lovelace' into table 'users'


    Wrote 'turing' into table 'users'

    View Slide

  82. Index Table
    id
    1
    2
    ???
    id username
    1 codd
    2 lovelace
    3 turing

    View Slide

  83. Index Table
    id
    1
    2
    3
    id username
    1 codd
    2 lovelace
    3 turing

    View Slide

  84. Also:


    replication

    View Slide

  85. Postgres
    Postgres
    Postgres
    Repl Repl
    API backend

    View Slide

  86. Postgres
    Postgres
    Postgres
    Repl Repl
    API backend
    WALs

    View Slide

  87. 2023-02-24 17:23:01 GMT LOG: restored log file
    "000000020000000000000003" from archive


    2023-02-24 17:23:02 GMT LOG: invalid record length
    at 0/3000180
    Suspicious log on synchronous replica

    View Slide

  88. Primary
    WAL archival
    Replica
    archive_command
    restore_command

    View Slide

  89. Issue restoring WAL





    Cause of failure to
    promote replica?

    View Slide

  90. We already had
    those writes!

    View Slide

  91. Just because something
    shouldn't happen


    doesn't mean it


    didn't happen

    View Slide

  92. 2023-02-24 17:23:01 GMT LOG: restored log file
    "000000020000000000000003" from archive


    2023-02-24 17:23:02 GMT LOG: invalid record length
    at 0/3000180
    Suspicious log on synchronous replica

    View Slide

  93. I had zero experience


    working with


    binary


    formats

    View Slide

  94. None of it


    is magic

    View Slide

  95. We can cheat:


    Postgres is


    open source

    View Slide

  96. But!

    View Slide

  97. These techniques
    also work on closed
    source software

    View Slide

  98. We just call that
    reverse engineering

    View Slide

  99. $ git checkout REL9_4_26 # we were running 9.4


    $ git grep -n "invalid record length"


    src/backend/access/transam/xlogreader.c:295: [...]
    src/backend/access/transam/xlogreader.c:604: [...]
    src/backend/access/transam/xlogreader.c:678: [...]
    Let's
    fi
    nd the error

    View Slide

  100. src/backend/access/transam/xlogreader.c:291-300:


    {


    /* XXX: more validation should be done here */


    if (total_len < SizeOfXLogRecord)


    {


    report_invalid_record(state, "invalid record length at %X/%X",


    (uint32) (RecPtr >> 32), (uint32) RecPtr);


    goto err;


    }


    gotheader = false;


    }
    Let's
    fi
    nd the error

    View Slide

  101. src/backend/access/transam/xlogreader.c:291-300:


    {


    /* XXX: more validation should be done here */


    if (total_len < SizeOfXLogRecord)


    {


    report_invalid_record(state, "invalid record length at %X/%X",


    (uint32) (RecPtr >> 32), (uint32) RecPtr);


    goto err;


    }


    gotheader = false;


    }
    Let's
    fi
    nd the error

    View Slide

  102. src/backend/access/transam/xlogreader.c:291-300:


    {


    /* XXX: more validation should be done here */


    if (total_len < SizeOfXLogRecord)


    {


    report_invalid_record(state, "invalid record length at %X/%X",


    (uint32) (RecPtr >> 32), (uint32) RecPtr);


    goto err;


    }


    gotheader = false;


    }
    Let's
    fi
    nd the error

    View Slide

  103. src/include/access/xlog.h:58:


    #define SizeOfXLogRecord MAXALIGN(sizeof(XLogRecord))
    Let's
    fi
    nd the error

    View Slide

  104. Wouldn't it be convenient
    if we could make
    total_len == 0?

    View Slide

  105. src/backend/access/transam/xlogreader.c:272-273:


    record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);


    total_len = record->xl_tot_len;
    Let's
    fi
    nd the error

    View Slide

  106. src/include/access/xlog.h:41-56:


    typedef struct XLogRecord


    {


    uint32 xl_tot_len; /* total len of entire record */


    TransactionId xl_xid; /* xact id */


    uint32 xl_len; /* total len of rmgr data */


    uint8 xl_info; /* flag bits, see below */


    RmgrId xl_rmid; /* resource manager for this record */


    /* 2 bytes of padding here, initialize to zero */


    XLogRecPtr xl_prev; /* ptr to previous record in log */


    pg_crc32 xl_crc; /* CRC for this record */


    /* If MAXALIGN==8, there are 4 wasted bytes here */


    /* ACTUAL LOG DATA FOLLOWS AT END OF STRUCT */


    } XLogRecord;
    Let's
    fi
    nd the error

    View Slide

  107. src/include/access/xlog.h:41-56:


    typedef struct XLogRecord


    {


    uint32 xl_tot_len; /* total len of entire record */


    TransactionId xl_xid; /* xact id */


    uint32 xl_len; /* total len of rmgr data */


    uint8 xl_info; /* flag bits, see below */


    RmgrId xl_rmid; /* resource manager for this record */


    /* 2 bytes of padding here, initialize to zero */


    XLogRecPtr xl_prev; /* ptr to previous record in log */


    pg_crc32 xl_crc; /* CRC for this record */


    /* If MAXALIGN==8, there are 4 wasted bytes here */


    /* ACTUAL LOG DATA FOLLOWS AT END OF STRUCT */


    } XLogRecord;
    Let's
    fi
    nd the error

    View Slide

  108. View Slide

  109. src/backend/access/transam/xlogreader.c:291-300:


    {


    /* XXX: more validation should be done here */


    if (total_len < SizeOfXLogRecord)


    {


    report_invalid_record(state, "invalid record length at %X/%X",


    (uint32) (RecPtr >> 32), (uint32) RecPtr);


    goto err;


    }


    gotheader = false;


    }
    What was that check doing?

    View Slide

  110. What was that check doing?
    Size the record says it is
    Smallest possible size it can be
    src/backend/access/transam/xlogreader.c:291-300:


    {


    /* XXX: more validation should be done here */


    if (total_len < SizeOfXLogRecord)


    {


    report_invalid_record(state, "invalid record length at %X/%X",


    (uint32) (RecPtr >> 32), (uint32) RecPtr);


    goto err;


    }


    gotheader = false;


    }

    View Slide

  111. INSERT INTO users VALUES ('codd');


    INSERT INTO users VALUES ('lovelace');


    INSERT INTO users VALUES ('turing');





    Wrote 'codd' into table 'users'


    Wrote 'lovelace' into table 'users'


    Wrote 'turing' into table 'users'
    A different kind of logs

    (if they were textual)

    View Slide

  112. Let's see what they
    look like in
    practice

    View Slide

  113. INSERT INTO users VALUES ('codd');


    INSERT INTO users VALUES ('lovelace');


    INSERT INTO users VALUES ('turing');
    Some extremely boring SQL

    View Slide

  114. Grab the binary
    log
    fi
    le, and...

    View Slide

  115. A barely comprehensible

    wall of data 😅

    View Slide

  116. A barely comprehensible

    wall of data 😅
    Hex ASCII

    View Slide

  117. Same data, rendered differently
    Decimal Hexadecimal Character
    62 3E >
    63 3F ?
    64 40 @
    65 41 A
    66 42 B

    View Slide

  118. Decimal Hexadecimal Character
    62 3E >
    63 3F ?
    64 40 @
    65 41 A
    66 42 B
    Same data, rendered differently

    View Slide

  119. Hex ASCII
    Some good news

    View Slide

  120. Some good news
    We can see our users!!

    View Slide

  121. How can we
    fi
    nd
    xl_tot_len?

    View Slide

  122. INSERT INTO repro VALUES ('A');


    INSERT INTO repro VALUES ('AB');


    INSERT INTO repro VALUES ('ABC');


    INSERT INTO repro VALUES ('ABCD');


    INSERT INTO repro VALUES ('ABCDE');


    ...
    Some even more boring SQL

    View Slide

  123. Look for a
    fi
    eld
    increasing

    by 1

    View Slide

  124. Guesswork incoming!

    View Slide

  125. Guesswork incoming!
    The data we inserted

    View Slide

  126. A little help: ASCII codes
    Decimal Hexadecimal Character
    62 3E >
    63 3F ?
    64 40 @
    65 41 A
    66 42 B

    View Slide

  127. Notice anything?
    The data we inserted

    View Slide

  128. Notice anything?
    The data we inserted
    Familiar characters

    View Slide

  129. Notice anything?
    Decimal Hexadecimal Character
    63 3F ?
    64 40 @
    65 41 A
    The data we inserted
    Familiar characters

    View Slide

  130. Notice anything?
    The data we inserted
    Familiar characters

    View Slide

  131. Notice anything?
    The data we inserted
    Familiar characters

    View Slide

  132. Notice anything?
    The data we inserted
    Familiar characters
    Familiar characters (hex)
    The data we inserted (hex)

    View Slide

  133. Wouldn't it be convenient
    if we could make
    total_len == 0?

    View Slide

  134. We could import the
    Postgres structs and do
    this properly...

    View Slide

  135. ...or we could write a
    regex 🤔

    View Slide

  136. Let's write a regex

    View Slide

  137. View Slide

  138. Let's break this one

    View Slide

  139. wal_file_name = ARGV[0]


    puts wal_file_name


    wal_contents = IO.read(wal_file_name, encoding: "BINARY")


    hex = wal_contents.unpack("H*").first


    replaced = hex.gsub(/3f(000000.+41424300)/, "00\\1")


    bindata = [replaced].pack("H*")


    File.write(wal_file_name + ".broken", bindata)
    break_wal.rb

    View Slide

  140. wal_file_name = ARGV[0]


    puts wal_file_name


    wal_contents = IO.read(wal_file_name, encoding: "BINARY")


    hex = wal_contents.unpack("H*").first


    replaced = hex.gsub(/3f(000000.+41424300)/, "00\\1") # Replaces 'ABC' size


    bindata = [replaced].pack("H*")


    File.write(wal_file_name + ".broken", bindata)
    break_wal.rb

    View Slide

  141. Let's break this one

    View Slide

  142. Broken!!

    View Slide

  143. And if we give it to
    a Postgres

    replica?

    View Slide

  144. 2023-02-28 19:24:11 GMT LOG: restored log file
    "000000020000000000000003" from archive


    2023-02-28 19:24:11 GMT LOG: invalid record length
    at 0/3000148
    We reproduced the error!

    View Slide

  145. Success 😄

    View Slide

  146. Success, with a
    caveat...😔

    View Slide

  147. This wasn't enough
    to reproduce


    the outage

    View Slide

  148. 1. RAID array loses disks


    2. Kernel sets
    fi
    lesystem read-only


    3. Pacemaker detects primary failure


    4. Synchronous replica crash


    5. Suspicious log on synchronous replica

    View Slide

  149. 1. RAID array loses disks


    2. Kernel sets
    fi
    lesystem read-only


    3. Pacemaker detects primary failure


    4. Synchronous replica crash


    5. Suspicious log on synchronous replica


    6. ...

    View Slide

  150. Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    VIP
    API backend

    View Slide

  151. Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    VIP
    API backend
    Backup


    VIP

    View Slide

  152. 1. RAID array loses disks


    2. Kernel sets
    fi
    lesystem read-only


    3. Pacemaker detects primary failure


    4. Synchronous replica crash


    5. Suspicious log on synchronous replica


    6. Backup VIP on synchronous replica

    View Slide

  153. We added it to
    the cluster

    View Slide

  154. Ran the repro
    script

    View Slide

  155. and...

    View Slide

  156. Success


    (no caveats)


    😄

    View Slide

  157. but...


    why?

    View Slide

  158. Background:


    how Pacemaker
    schedules resources

    View Slide

  159. 2relevant
    settings

    View Slide

  160. By default:
    reschedule
    without penalty

    View Slide

  161. Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    VIP
    API backend

    View Slide

  162. Postgres
    Postgres
    Postgres
    Repl
    VIP
    Pacemaker Pacemaker Pacemaker
    Repl
    API backend

    View Slide

  163. Setting:


    default-resource-stickiness

    View Slide

  164. By default:
    resources can run
    anywhere

    View Slide

  165. Setting:


    colocation

    View Slide

  166. default-resource-stickiness = 100


    &


    colocation -inf: BackupVIP Primary

    View Slide

  167. default-resource-stickiness = 100


    &


    colocation -inf: BackupVIP Primary

    View Slide

  168. default-resource-stickiness = 100


    &


    colocation -inf: BackupVIP Primary

    View Slide

  169. A very subtle
    semantic
    difference

    View Slide

  170. -1000 -inf

    View Slide

  171. -1000 -inf
    "Avoid
    scheduling
    these together"

    View Slide

  172. -1000 -inf
    "Avoid
    scheduling
    these together"
    "Literally never
    schedule these
    together"

    View Slide

  173. default-resource-stickiness = 100


    &


    colocation -inf: BackupVIP Primary

    View Slide

  174. default-resource-stickiness = 100


    &


    colocation -1000: BackupVIP Primary

    View Slide

  175. Failover works


    properly

    View Slide

  176. P.S. The WAL error was a red herring

    View Slide

  177. Sorry

    View Slide

  178. I know it was the most interesting part

    View Slide

  179. and it would have been kinda cool

    View Slide

  180. but it was part of the debugging process

    View Slide

  181. 💖

    View Slide

  182. 1. RAID array loses disks


    2. Kernel sets
    fi
    lesystem read-only


    3. Pacemaker detects primary failure


    4. Synchronous replica crash


    5. Suspicious log on synchronous replica


    6. Backup VIP on synchronous replica

    View Slide

  183. What can we


    learn?

    View Slide

  184. None of the


    stack


    is magic

    View Slide

  185. None of the


    stack


    is magic
    😁

    View Slide

  186. None of the


    stack


    is magic
    😁
    😰

    View Slide

  187. "It's just someone else's
    computer"

    View Slide

  188. "It's just someone else's
    abstraction"

    View Slide

  189. Read


    other people's


    code...

    View Slide

  190. ...and


    try to


    modify it

    View Slide

  191. Automation


    erodes


    knowledge

    View Slide

  192. Game days are
    a partial
    fi
    x

    View Slide

  193. "What if we had to
    recover our database
    server manually?"

    View Slide

  194. Don't stop
    questioning
    your repro

    View Slide

  195. 1. No magic in the stack


    2. Automation erodes knowledge


    3. Always question the repro

    View Slide

  196. View Slide

  197. JSON


    over


    HTTP

    View Slide

  198. Binary


    formats


    are coming


    to web development

    View Slide

  199. Protobuf


    over


    HTTP/2

    View Slide

  200. Protobuf


    over


    HTTP/2
    (e.g. gRPC)

    View Slide

  201. It's


    worth


    getting


    familiar

    View Slide

  202. One last thing to
    ask of


    you

    View Slide

  203. Most computing
    happens
    successfully

    View Slide

  204. The


    0.00001%


    * not a real statistic

    View Slide

  205. Outsized


    negative


    impact

    View Slide

  206. It's a shame


    not to


    learn

    View Slide

  207. "We noticed a problem."


    "We
    fi
    xed the problem."


    "We'll make sure the problem doesn't
    happen again."

    View Slide

  208. 3good


    examples

    View Slide

  209. https://slack.engineering/slacks-outage-on-january-4th-2021/

    View Slide

  210. https://incident.io/blog/intermittent-downtime

    View Slide

  211. https://about.gitlab.com/blog/2017/02/10/postmortem-of-database-outage-of-
    january-31/

    View Slide

  212. Please


    Share the dif
    fi
    cult stories too

    View Slide

  213. Thank you
    ✌❤
    @planetscaledata
    sinjo.dev

    View Slide

  214. View Slide

  215. Image credits
    • Programmer's Laptop - Wall Boat - Public Domain - https://www.
    fl
    ickr.com/photos/
    wallboat/36819065315/


    • Pouring Latte Art - Craft Coffee Spot - CC-BY - https://www.
    fl
    ickr.com/photos/
    195403219@N08/52200966448/


    • microscope - Milosz1 - CC-BY - https://www.
    fl
    ickr.com/photos/mikolski/3269906279


    • Hard Disk Guts - CC-BY - https://www.
    fl
    ickr.com/photos/mattandkim/97533589/


    • Corsair ForceGT 180GB - CC-BY - https://www.
    fl
    ickr.com/photos/ruocaled/8173124575/

    View Slide

  216. Image credits
    • Server - The Noun Project (via WikiMedia) - CC0 - https://commons.wikimedia.org/wiki/
    File:Server_-_The_Noun_Project.svg


    • Rope - Robo Android - CC-BY - https://www.
    fl
    ickr.com/photos/
    49140926@N07/6798304070/


    • Stargazing - Max Delaquis - CC-BY - https://www.
    fl
    ickr.com/photos/
    115000114@N07/28861043652

    View Slide

  217. Questions?
    ✌❤
    @planetscaledata
    sinjo.dev

    View Slide