Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Git: the NoSQL Database

Git: the NoSQL Database

Check out the video for this talk: https://vimeo.com/56645405

We all know that Git is amazing for storing code. It is fast, reliable, flexible, and it keeps our project history nuzzled safely in its object database while we sleep soundly at night.

But what about storing more than code? Why not data? Much flexibility is gained by ditching traditional databases, but at what cost?

Brandon Keepers
PRO

April 21, 2012
Tweet

More Decks by Brandon Keepers

Other Decks in Programming

Transcript

  1. NoSQL
    database
    the
    by Brandon Keepers

    View Slide

  2. 2 million years ago
    our ancestors started a revolution

    View Slide

  3. http://commons.wikimedia.org/wiki/File:Olduvai_stone_chopping_tool_at_British_Museum.jpg

    View Slide

  4. http://www.flickr.com/photos/birminghammag/6282945952

    View Slide

  5. @bkeepers
    github.com/bkeepers
    Hi, I am
    Brandon

    View Slide

  6. git is amazing at storing code…
    how well does it store data?

    View Slide

  7. github.com/bkeepers/gaskit

    View Slide

  8. disclaimer:
    NoSQL is marketing bollocks

    View Slide

  9. NoSQL
    non-relational and often schema-less.

    View Slide

  10. Relational
    PostgreSQL, MySQL
    NoSQL
    key/value
    Riak, Redis, memcached
    Columnar
    HBase, (Cassandra)
    Document
    MongoDB, CouchDB
    Graph
    Neo4J

    View Slide

  11. 1. git as a data store
    2. features
    3. anti-features

    View Slide

  12. using git
    as a data store

    View Slide

  13. $ man git

    View Slide

  14. if git is really a database then
    how do we store data in it?

    View Slide

  15. the naïve way

    View Slide

  16. the naïve way
    $ git init mydb && cd mydb
    Initialized empty Git repository in mydb/.git/

    View Slide

  17. the naïve way

    View Slide

  18. the naïve way
    $ echo '{"name":"Brandon Keepers","company":"GitHub"}' \
    > 1.json

    View Slide

  19. the naïve way
    $ echo '{"name":"Brandon Keepers","company":"GitHub"}' \
    > 1.json
    $ git add 1.json

    View Slide

  20. the naïve way
    $ echo '{"name":"Brandon Keepers","company":"GitHub"}' \
    > 1.json
    $ git add 1.json
    $ git commit -m 'adding 1.json'
    [master (root-commit) f0e15a1] adding 1.json
    1 file changed, 1 insertion(+)
    create mode 100644 1.json

    View Slide

  21. the naïve way
    $ git show master:1.json
    {"name":"Brandon Keepers","company":"GitHub"}

    View Slide

  22. tada! a database
    if you call the filesystem a database

    View Slide

  23. git’s data model

    View Slide

  24. commit
    tree: be1b57ea
    parent: nil
    author: Brandon
    message:
    Initial commit
    c67d5118
    tree
    blob: 7041879e README.md
    tree: 0662dca7 public
    be1b57ea
    tree
    blob: be1b57ea index.html
    tree: 2d21ba18 css
    0662dca7
    blob
    # Git
    The stupid content
    tracker.
    7041879e
    blob


    Git

    be1b57ea
    tree
    blob: be1b57ea app.css
    blob: 049fd918 reset.css
    2d21ba18
    reference
    c67d5118
    master

    .git/objects/

    View Slide

  25. tree
    blob: be1b57ea index.html
    tree: 2d21ba18 css
    0662dca7
    blob
    # Git
    The stupid content
    tracker.
    blob


    Git

    be1b57ea
    .git/objects/

    View Slide

  26. tree
    b: 7041879e README.md
    e: 0662dca7 public
    1b57ea
    tree
    blob: be1b57ea index.html
    tree: 2d21ba18 css
    0662dca7
    blob
    # Git
    The stupid content
    tracker.
    blob


    Git

    be1b57ea
    .git/objects/

    View Slide

  27. commit
    tree: be1b57ea
    parent: nil
    author: Brandon
    message:
    Initial commit
    c67d5118
    tree
    blob: 7041879e README.md
    tree: 0662dca7 public
    be1b57ea
    nce
    8
    .git/objects/

    View Slide

  28. commit
    tree: be1b57ea
    parent: nil
    author: Brandon
    message:
    Initial commit
    c67d5118
    blob:
    tree:
    be1b
    reference
    c67d5118
    master
    .git/objects/

    View Slide

  29. commit
    tree: be1b57ea
    parent: nil
    author: Brandon
    message:
    Initial commit
    c67d5118
    tree
    blob: 7041879e README.md
    tree: 0662dca7 public
    be1b57ea
    tree
    blob: be1b57ea index.html
    tree: 2d21ba18 css
    0662dca7
    blob
    # Git
    The stupid content
    tracker.
    7041879e
    blob


    Git

    be1b57ea
    tree
    blob: be1b57ea app.css
    blob: 049fd918 reset.css
    2d21ba18
    reference
    c67d5118
    master

    .git/objects/

    View Slide

  30. commit
    tree: be1b57ea
    parent: nil
    author: Brandon
    message:
    Initial commit
    c67d5118
    tree
    blob: 7041879e README.md
    tree: 0662dca7 public
    be1b57ea
    commit
    tree: 1002d7b0
    parent: c67d5118
    author: Brandon
    message:
    Initial commit
    c816ef7e
    tree
    blob: bc912988 README.md
    tree: 0662dca7 public
    1002d7b0
    reference
    c816ef7e
    master
    tree
    blob: be1b57ea index.html
    tree: 2d21ba18 css
    0662dca7
    blob


    Git

    be1b57ea
    blob
    # Git
    The stupid content
    tracker.
    7041879e
    2d21ba18
    blob
    # Git
    The dumb content tracker.
    bc912988

    View Slide

  31. tree
    b: bc912988 README.md
    e: 0662dca7 public
    02d7b0
    blob
    # Git
    The stupid content
    tracker.
    7041879e
    blob
    # Git
    The dumb content tracker.
    bc912988

    View Slide

  32. mmit
    1002d7b0
    c67d5118
    Brandon
    ommit
    7e
    tree
    blob: bc912988 README.md
    tree: 0662dca7 public
    1002d7b0
    blob
    # Git
    The stupid content
    tracker.
    7041879e
    blob
    # Git
    The dumb content tracker.
    bc912988

    View Slide

  33. tree
    blob: 7041879e README.md
    tree: 0662dca7 public
    be1b57ea
    blob: bc912988 README.md
    tree: 0662dca7 public
    tree
    blob: be1b57ea index.html
    tree: 2d21ba18 css
    0662dca7

    View Slide

  34. commit
    tree: 1002d7b0
    parent: c67d5118
    author: Brandon
    message:
    Initial commit
    c816ef7e
    tre
    blob: bc912988
    tree: 0662dca7
    1002d7b0
    reference
    c816ef7e
    master

    View Slide

  35. commit
    tree: be1b57ea
    parent: nil
    author: Brandon
    c67d5118
    tree
    blob: 7041879e R
    be1b57ea
    commit
    tree: 1002d7b0
    parent: c67d5118
    author: Brandon
    message:
    Initial commit
    tree
    blob: bc912988 R
    tree: 0662dca7 p
    1002d7b0
    reference
    c816ef7e

    View Slide

  36. commit
    tree: be1b57ea
    parent: nil
    author: Brandon
    message:
    Initial commit
    c67d5118
    tree
    blob: 7041879e README.md
    tree: 0662dca7 public
    be1b57ea
    commit
    tree: 1002d7b0
    parent: c67d5118
    author: Brandon
    message:
    Initial commit
    c816ef7e
    tree
    blob: bc912988 README.md
    tree: 0662dca7 public
    1002d7b0
    reference
    c816ef7e
    master
    tree
    blob: be1b57ea index.html
    tree: 2d21ba18 css
    0662dca7
    <

    be
    blob
    # Git
    The stupid content
    tracker.
    7041879e
    2d21ba18
    blob
    # Git
    The dumb content tracker.
    bc912988

    View Slide

  37. talking to git,
    programatically.

    View Slide

  38. a few libraries
    Grit
    https://github.com/mojombo/grit
    libgit2 (Ruby, .NET, PHP, Python, etc)
    https://github.com/libgit2/libgit2

    View Slide

  39. grit
    require 'grit'
    repo = Grit::Repo.new('.')

    View Slide

  40. writing

    View Slide

  41. writing
    # Get a new index that we can modify
    index = repo.index

    View Slide

  42. writing
    # Get a new index that we can modify
    index = repo.index
    # Get the current tree
    head = repo.get_head('master')
    index.current_tree = head.commit.tree

    View Slide

  43. writing
    # Get a new index that we can modify
    index = repo.index
    # Get the current tree
    head = repo.get_head('master')
    index.current_tree = head.commit.tree
    # Make our changes
    index.add('1.json', '{"name": "Brandon"}')
    index.commit('Add user 1',
    :parents => [head.commit], :head => 'master')

    View Slide

  44. reading
    head = repo.get_head('master')
    blob = head.commit.tree / '1.json'
    blob.data

    View Slide

  45. that seems like too much work.
    where’s my ORM?

    View Slide

  46. a few libraries
    Toystore + adapter-git
    https://github.com/bkeepers/adapter-git
    GitModel
    https://github.com/pauldowman/gitmodel

    View Slide

  47. Toystore + adapter-git
    class Issue
    include Toy::Store
    adapter :git, Grit::Repo.new(GIT_ROOT)
    attribute :description, String
    attribute :state, String, :default => 'open'
    end

    View Slide

  48. Toystore + adapter-git
    Issue.create(:description => 'Store in Git')
    issue = Issue.get(id)
    issue.update_attributes(:state => 'in_progress')
    issue.destroy

    View Slide

  49. features

    View Slide

  50. versioning

    View Slide

  51. diffs

    View Slide

  52. hooks
    update cache, alternate formats, or full-text indexes

    View Slide

  53. question everything about relational data design
    non-relational

    View Slide

  54. no BDUF
    optimize storage based on usage patterns

    View Slide

  55. schema-less
    easily change data as the application evolves

    View Slide

  56. class User
    # …
    attribute :first_name, String
    attribute :last_name, String
    end

    View Slide

  57. class User
    # …
    # attribute :first_name, String
    # attribute :last_name, String
    attribute :name, String
    def name
    super || "#{self[:first_name]} #{self[:last_name]}"
    end
    end

    View Slide

  58. transactions
    a commit can contain many changes

    View Slide

  59. long-lived transactions
    $ git checkout -b transaction

    $ git checkout master
    $ git merge transaction

    View Slide

  60. replication
    every clone contains a full copy

    View Slide

  61. add replica
    $ git remote add replica1 [email protected]:app.git
    $ cat .git/hooks/post-commit
    #!/bin/sh
    git push replica1

    View Slide

  62. anti-
    features

    View Slide

  63. yeah, git doesn’t have those.
    all the features that make a great DB

    View Slide

  64. querying
    you can just find it yourself

    View Slide

  65. concurrency
    why would you want…oooh

    View Slide

  66. concurrency

    View Slide

  67. concurrency
    index = repo.index
    head = repo.get_head('master')
    index.current_tree = head.commit.tree

    View Slide

  68. concurrency
    index = repo.index
    head = repo.get_head('master')
    index.current_tree = head.commit.tree
    # Nobody changed anything, right?

    View Slide

  69. concurrency
    index = repo.index
    head = repo.get_head('master')
    index.current_tree = head.commit.tree
    # Nobody changed anything, right?
    index.commit('...',
    :parents => [head.commit], :head => 'master')

    View Slide

  70. concurrency
    Lockfile.new('refs/heads/master.lock').lock do
    index = repo.index
    head = repo.get_head('master')
    index.current_tree = head.commit.tree
    index.commit('...',
    :parents => [head.commit], :head => 'master')
    end

    View Slide

  71. merge conflicts
    $ git merge branch
    Auto-merging 0/0/411460f7c92d2124a67ea0f4cb5f85
    CONFLICT (content): Merge conflict in 0/0/411460f7c92d2124a67ea0f4cb5f85
    Automatic merge failed; fix conflicts and then commit the result.

    View Slide

  72. git is not web scale

    View Slide

  73. hard write limit
    $ ruby commits_per_second.rb
    97.6538648174529 Commits/Second

    View Slide

  74. paths ma er
    $ ruby commits_per_second.rb --keys 1000
    14.083 Commits/Second
    $ ls | head -n 2
    00411460f7c92d2124a67ea0f4cb5f85
    006f52e9102a8d3be2fe5614f42ba989
    $ ls | wc -l
    1000

    View Slide

  75. Nest files in directories
    $ ruby commits_per_second.rb --keys 1000 --type nested
    67.117 Commits/Second
    $ tree
    !"" 0
    # !"" 0
    # # !"" 411460f7c92d2124a67ea0f4cb5f85
    # # !"" 6f52e9102a8d3be2fe5614f42ba989
    # # !"" ac8ed3b4327bdd4ebbebcb2ba10a00
    # # $"" ec53c4682d36f5c4359f4ae7bd7ba1
    # !"" 1
    # # !"" 161aaa0b6d1345dd8fe4e481144d84
    # # !"" 386bd6d8e091c2ab4c7c7de644d37b
    # # !"" 3a006f03dbc5392effeb8f18fda755
    # # !"" 3d407166ec4fa56eb1e1f8cbe183b9
    # # !"" 882513d5fa7c329e940dda99b12147
    # # !"" 9d385eb67632a7e958e23f24bd07d7
    # # $"" f78be6f7cad02658508fe4616098a9

    View Slide

  76. large repositories
    with long and storied histories.

    View Slide

  77. git at Facebook
    From: Joshua Redstone fb.com>
    Subject: Git performance results on a large repository
    Date: 2012-02-03 14:20:06 GMT
    Hi Git folks,
    We (Facebook) have been investigating source control systems to meet our
    growing needs. We already use git fairly widely, but have noticed it
    getting slower as we grow, and we want to make sure we have a good story
    going forward. We're debating how to proceed and would like to solicit
    people's thoughts.
    To better understand git scalability, I've built up a large, synthetic
    repository and measured a few git operations on it. I summarize the
    results here.
    The test repo has 4 million commits, linear history and about 1.3 million
    files. The size of the .git directory is about 15GB, and has been
    repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100
    --window=250'. This repack took about 2 days on a beefy machine (I.e.,
    lots of ram and flash). The size of the index file is 191 MB. I can share
    the script that generated it if people are interested - It basically picks
    2-5 files, modifies a line or two and adds a few lines at the end
    consisting of random dictionary words, occasionally creates a new file,
    commits all the modifications and repeats.
    I timed a few common operations with both a warm OS file cache and a cold
    cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did
    the operation in question a few times (first timing is the cold timing,
    the next few are the warm timings). The following results are on a server
    with average hard drive (I.e., not flash) and > 10GB of ram.
    http://thread.gmane.org/gmane.comp.version-control.git/189776

    View Slide

  78. git at Facebook
    From: Joshua Redstone fb.com>
    Subject: Git performance results on a large repository
    Date: 2012-02-03 14:20:06 GMT
    Hi Git folks,
    We (Facebook) have been investigating source control systems to meet our
    growing needs. We already use git fairly widely, but have noticed it
    getting slower as we grow, and we want to make sure we have a good story
    going forward. We're debating how to proceed and would like to solicit
    people's thoughts.
    To better understand git scalability, I've built up a large, synthetic
    repository and measured a few git operations on it. I summarize the
    results here.
    The test repo has 4 million commits, linear history and about 1.3 million
    files. The size of the .git directory is about 15GB, and has been
    repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100
    --window=250'. This repack took about 2 days on a beefy machine (I.e.,
    lots of ram and flash). The size of the index file is 191 MB. I can share
    the script that generated it if people are interested - It basically picks
    2-5 files, modifies a line or two and adds a few lines at the end
    consisting of random dictionary words, occasionally creates a new file,
    commits all the modifications and repeats.
    I timed a few common operations with both a warm OS file cache and a cold
    cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did
    the operation in question a few times (first timing is the cold timing,
    the next few are the warm timings). The following results are on a server
    with average hard drive (I.e., not flash) and > 10GB of ram.
    4 million commits
    1.3 million files
    15 GB
    http://thread.gmane.org/gmane.comp.version-control.git/189776

    View Slide

  79. git at Facebook
    From: Joshua Redstone fb.com>
    Subject: Git performance results on a large repository
    Date: 2012-02-03 14:20:06 GMT
    Hi Git folks,
    We (Facebook) have been investigating source control systems to meet our
    growing needs. We already use git fairly widely, but have noticed it
    getting slower as we grow, and we want to make sure we have a good story
    going forward. We're debating how to proceed and would like to solicit
    people's thoughts.
    To better understand git scalability, I've built up a large, synthetic
    repository and measured a few git operations on it. I summarize the
    results here.
    The test repo has 4 million commits, linear history and about 1.3 million
    files. The size of the .git directory is about 15GB, and has been
    repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100
    --window=250'. This repack took about 2 days on a beefy machine (I.e.,
    lots of ram and flash). The size of the index file is 191 MB. I can share
    the script that generated it if people are interested - It basically picks
    2-5 files, modifies a line or two and adds a few lines at the end
    consisting of random dictionary words, occasionally creates a new file,
    commits all the modifications and repeats.
    I timed a few common operations with both a warm OS file cache and a cold
    cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did
    the operation in question a few times (first timing is the cold timing,
    the next few are the warm timings). The following results are on a server
    with average hard drive (I.e., not flash) and > 10GB of ram.
    4 million commits
    1.3 million files
    15 GB
    http://thread.gmane.org/gmane.comp.version-control.git/189776
    git add: 7 seconds
    git status: 39 minutes
    git commit: 41 minutes

    View Slide

  80. if git doesn’t scale, then how does
    GitHub Scale?

    View Slide

  81. View Slide

  82. smoke
    grit, in the cloud.

    View Slide

  83. github.com
    router
    file servers
    rpc

    View Slide

  84. write limit per repo
    but we have many repos on many discs.

    View Slide

  85. Invocation

    View Slide

  86. use cases
    where git would make a good database.

    View Slide

  87. content heavy
    CMS, translations, wikis

    View Slide

  88. partitionable
    GitHub, project management

    View Slide

  89. offline

    View Slide

  90. some examples:
    madrox
    github.com/technoweenie/madrox
    gollum
    github.com/github/gollum
    gaskit
    github.com/bkeepers/gaskit

    View Slide

  91. abuse your tools
    and imagine how to make them better

    View Slide

  92. credits & references
    Talk by Rick Olson
    http://git-nosql-rubyconf.heroku.com
    Peepcode: Git Internals
    https://peepcode.com/products/git-internals-pdf

    View Slide

  93. credits & references

    View Slide

  94. credits & references

    View Slide

  95. credits & references

    View Slide

  96. questions?
    @bkeepers
    github.com/bkeepers
    speakerdeck.com/bkeepers/git-the-no-sql-database

    View Slide