Git: the NoSQL Database

NoSQL
database
the
by Brandon Keepers

View Slide

2 million years ago
our ancestors started a revolution

View Slide

http://commons.wikimedia.org/wiki/File:Olduvai_stone_chopping_tool_at_British_Museum.jpg

View Slide

http://www.ﬂickr.com/photos/birminghammag/6282945952

View Slide

@bkeepers
github.com/bkeepers
Hi, I am
Brandon

View Slide

git is amazing at storing code…
how well does it store data?

View Slide

github.com/bkeepers/gaskit

View Slide

disclaimer:
NoSQL is marketing bollocks

View Slide

NoSQL
non-relational and often schema-less.

View Slide

Relational
PostgreSQL, MySQL
NoSQL
key/value
Riak, Redis, memcached
Columnar
HBase, (Cassandra)
Document
MongoDB, CouchDB
Graph
Neo4J

View Slide

1. git as a data store
2. features
3. anti-features

View Slide

using git
as a data store

View Slide

$ man git

View Slide

if git is really a database then
how do we store data in it?

View Slide

the naïve way

View Slide

the naïve way
$ git init mydb && cd mydb
Initialized empty Git repository in mydb/.git/

View Slide

the naïve way

View Slide

the naïve way
$ echo '{"name":"Brandon Keepers","company":"GitHub"}' \
> 1.json

View Slide

the naïve way
> 1.json
$ git add 1.json

View Slide

the naïve way
> 1.json
$ git add 1.json
$ git commit -m 'adding 1.json'
[master (root-commit) f0e15a1] adding 1.json
1 file changed, 1 insertion(+)
create mode 100644 1.json

View Slide

the naïve way
$ git show master:1.json
{"name":"Brandon Keepers","company":"GitHub"}

View Slide

tada! a database
if you call the filesystem a database

View Slide

git’s data model

View Slide

commit
tree: be1b57ea
parent: nil
author: Brandon
message:
Initial commit
c67d5118
tree
blob: 7041879e README.md
tree: 0662dca7 public
be1b57ea
tree
blob: be1b57ea index.html
tree: 2d21ba18 css
0662dca7
blob
# Git
The stupid content
tracker.
7041879e
blob

Git
…
be1b57ea
tree
blob: be1b57ea app.css
blob: 049fd918 reset.css
2d21ba18
reference
c67d5118
master
…
.git/objects/

View Slide

tree
tree: 2d21ba18 css
0662dca7
blob
# Git
The stupid content
tracker.
blob

Git
…
be1b57ea
.git/objects/

View Slide

tree
b: 7041879e README.md
e: 0662dca7 public
1b57ea
tree
tree: 2d21ba18 css
0662dca7
blob
# Git
The stupid content
tracker.
blob

Git
…
be1b57ea
.git/objects/

View Slide

commit
tree: be1b57ea
parent: nil
author: Brandon
message:
Initial commit
c67d5118
tree
be1b57ea
nce
8
.git/objects/

View Slide

commit
tree: be1b57ea
parent: nil
author: Brandon
message:
Initial commit
c67d5118
blob:
tree:
be1b
reference
c67d5118
master
.git/objects/

View Slide

commit
tree: be1b57ea
parent: nil
author: Brandon
message:
Initial commit
c67d5118
tree
be1b57ea
tree
tree: 2d21ba18 css
0662dca7
blob
# Git
The stupid content
tracker.
7041879e
blob

Git
…
be1b57ea
tree
blob: be1b57ea app.css
blob: 049fd918 reset.css
2d21ba18
reference
c67d5118
master
…
.git/objects/

View Slide

commit
tree: be1b57ea
parent: nil
author: Brandon
message:
Initial commit
c67d5118
tree
be1b57ea
commit
tree: 1002d7b0
parent: c67d5118
author: Brandon
message:
Initial commit
c816ef7e
tree
blob: bc912988 README.md
1002d7b0
reference
c816ef7e
master
tree
tree: 2d21ba18 css
0662dca7
blob

Git
…
be1b57ea
blob
# Git
The stupid content
tracker.
7041879e
2d21ba18
blob
# Git
The dumb content tracker.
bc912988

View Slide

tree
b: bc912988 README.md
e: 0662dca7 public
02d7b0
blob
# Git
The stupid content
tracker.
7041879e
blob
# Git
bc912988

View Slide

mmit
1002d7b0
c67d5118
Brandon
ommit
7e
tree
1002d7b0
blob
# Git
The stupid content
tracker.
7041879e
blob
# Git
bc912988

View Slide

tree
be1b57ea
tree
tree: 2d21ba18 css
0662dca7

View Slide

commit
tree: 1002d7b0
parent: c67d5118
author: Brandon
message:
Initial commit
c816ef7e
tre
blob: bc912988
tree: 0662dca7
1002d7b0
reference
c816ef7e
master

View Slide

commit
tree: be1b57ea
parent: nil
author: Brandon
c67d5118
tree
blob: 7041879e R
be1b57ea
commit
tree: 1002d7b0
parent: c67d5118
author: Brandon
message:
Initial commit
tree
blob: bc912988 R
tree: 0662dca7 p
1002d7b0
reference
c816ef7e

View Slide

commit
tree: be1b57ea
parent: nil
author: Brandon
message:
Initial commit
c67d5118
tree
be1b57ea
commit
tree: 1002d7b0
parent: c67d5118
author: Brandon
message:
Initial commit
c816ef7e
tree
1002d7b0
reference
c816ef7e
master
tree
tree: 2d21ba18 css
0662dca7
<
…
be
blob
# Git
The stupid content
tracker.
7041879e
2d21ba18
blob
# Git
bc912988

View Slide

talking to git,
programatically.

View Slide

a few libraries
Grit
https://github.com/mojombo/grit
libgit2 (Ruby, .NET, PHP, Python, etc)
https://github.com/libgit2/libgit2

View Slide

grit
require 'grit'
repo = Grit::Repo.new('.')

View Slide

writing

View Slide

writing
# Get a new index that we can modify
index = repo.index

View Slide

writing
index = repo.index
# Get the current tree
head = repo.get_head('master')
index.current_tree = head.commit.tree

View Slide

writing
index = repo.index
# Get the current tree
# Make our changes
index.add('1.json', '{"name": "Brandon"}')
index.commit('Add user 1',
:parents => [head.commit], :head => 'master')

View Slide

reading
blob = head.commit.tree / '1.json'
blob.data

View Slide

that seems like too much work.
where’s my ORM?

View Slide

a few libraries
Toystore + adapter-git
https://github.com/bkeepers/adapter-git
GitModel
https://github.com/pauldowman/gitmodel

View Slide

class Issue
include Toy::Store
adapter :git, Grit::Repo.new(GIT_ROOT)
attribute :description, String
attribute :state, String, :default => 'open'
end

View Slide

Issue.create(:description => 'Store in Git')
issue = Issue.get(id)
issue.update_attributes(:state => 'in_progress')
issue.destroy

View Slide

features

View Slide

versioning

View Slide

diﬀs

View Slide

hooks
update cache, alternate formats, or full-text indexes

View Slide

question everything about relational data design
non-relational

View Slide

no BDUF
optimize storage based on usage patterns

View Slide

schema-less
easily change data as the application evolves

View Slide

class User
# …
attribute :first_name, String
attribute :last_name, String
end

View Slide

class User
# …
# attribute :first_name, String
# attribute :last_name, String
attribute :name, String
def name
super || "#{self[:first_name]} #{self[:last_name]}"
end
end

View Slide

transactions
a commit can contain many changes

View Slide

long-lived transactions
$ git checkout -b transaction
…
$ git checkout master
$ git merge transaction

View Slide

replication
every clone contains a full copy

View Slide

add replica
$ git remote add replica1 [email protected]:app.git
$ cat .git/hooks/post-commit
#!/bin/sh
git push replica1

View Slide

anti-
features

View Slide

yeah, git doesn’t have those.
all the features that make a great DB

View Slide

querying
you can just find it yourself

View Slide

concurrency
why would you want…oooh

View Slide

concurrency

View Slide

concurrency
index = repo.index

View Slide

concurrency
index = repo.index
# Nobody changed anything, right?

View Slide

concurrency
index = repo.index
# Nobody changed anything, right?
index.commit('...',

View Slide

concurrency
Lockfile.new('refs/heads/master.lock').lock do
index = repo.index
index.commit('...',
end

View Slide

merge conﬂicts
$ git merge branch
Auto-merging 0/0/411460f7c92d2124a67ea0f4cb5f85
CONFLICT (content): Merge conflict in 0/0/411460f7c92d2124a67ea0f4cb5f85
Automatic merge failed; fix conflicts and then commit the result.

View Slide

git is not web scale

View Slide

hard write limit
$ ruby commits_per_second.rb
97.6538648174529 Commits/Second

View Slide

paths ma er
$ ruby commits_per_second.rb --keys 1000
$ ls | head -n 2
00411460f7c92d2124a67ea0f4cb5f85
006f52e9102a8d3be2fe5614f42ba989
$ ls | wc -l
1000

View Slide

Nest ﬁles in directories
$ ruby commits_per_second.rb --keys 1000 --type nested
$ tree
!"" 0
# !"" 0
# # !"" 411460f7c92d2124a67ea0f4cb5f85
# # !"" 6f52e9102a8d3be2fe5614f42ba989
# # !"" ac8ed3b4327bdd4ebbebcb2ba10a00
# # $"" ec53c4682d36f5c4359f4ae7bd7ba1
# !"" 1
# # !"" 161aaa0b6d1345dd8fe4e481144d84
# # !"" 386bd6d8e091c2ab4c7c7de644d37b
# # !"" 3a006f03dbc5392effeb8f18fda755
# # !"" 3d407166ec4fa56eb1e1f8cbe183b9
# # !"" 882513d5fa7c329e940dda99b12147
# # !"" 9d385eb67632a7e958e23f24bd07d7
# # $"" f78be6f7cad02658508fe4616098a9

View Slide

large repositories
with long and storied histories.

View Slide

git at Facebook
From: Joshua Redstone fb.com>
Subject: Git performance results on a large repository
Date: 2012-02-03 14:20:06 GMT
Hi Git folks,
We (Facebook) have been investigating source control systems to meet our
growing needs. We already use git fairly widely, but have noticed it
getting slower as we grow, and we want to make sure we have a good story
going forward. We're debating how to proceed and would like to solicit
people's thoughts.
To better understand git scalability, I've built up a large, synthetic
repository and measured a few git operations on it. I summarize the
results here.
The test repo has 4 million commits, linear history and about 1.3 million
files. The size of the .git directory is about 15GB, and has been
repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100
--window=250'. This repack took about 2 days on a beefy machine (I.e.,
lots of ram and flash). The size of the index file is 191 MB. I can share
the script that generated it if people are interested - It basically picks
2-5 files, modifies a line or two and adds a few lines at the end
consisting of random dictionary words, occasionally creates a new file,
commits all the modifications and repeats.
I timed a few common operations with both a warm OS file cache and a cold
cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did
the operation in question a few times (first timing is the cold timing,
the next few are the warm timings). The following results are on a server
with average hard drive (I.e., not flash) and > 10GB of ram.
http://thread.gmane.org/gmane.comp.version-control.git/189776

View Slide

git at Facebook
Date: 2012-02-03 14:20:06 GMT
Hi Git folks,
people's thoughts.
results here.
4 million commits
1.3 million ﬁles
15 GB

View Slide

git at Facebook
Date: 2012-02-03 14:20:06 GMT
Hi Git folks,
people's thoughts.
results here.
4 million commits
1.3 million ﬁles
15 GB
git add: 7 seconds
git status: 39 minutes
git commit: 41 minutes

View Slide

if git doesn’t scale, then how does
GitHub Scale?

View Slide

View Slide

smoke
grit, in the cloud.

View Slide

github.com
router
file servers
rpc

View Slide

write limit per repo
but we have many repos on many discs.

View Slide

Invocation

View Slide

use cases
where git would make a good database.

View Slide

content heavy
CMS, translations, wikis

View Slide

partitionable
GitHub, project management

View Slide

offline

View Slide

some examples:
madrox
github.com/technoweenie/madrox
gollum
github.com/github/gollum
gaskit
github.com/bkeepers/gaskit

View Slide

abuse your tools
and imagine how to make them better

View Slide

credits & references
Talk by Rick Olson
http://git-nosql-rubyconf.heroku.com
Peepcode: Git Internals
https://peepcode.com/products/git-internals-pdf

View Slide

credits & references

View Slide

questions?
@bkeepers
github.com/bkeepers
speakerdeck.com/bkeepers/git-the-no-sql-database

View Slide

Git: the NoSQL Database

Git: the NoSQL Database

Brandon Keepers
PRO

More Decks by Brandon Keepers

Other Decks in Programming

Featured

Transcript

Git: the NoSQL Database

Git: the NoSQL Database

Brandon Keepers PRO

More Decks by Brandon Keepers

Other Decks in Programming

Featured

Transcript

Brandon Keepers
PRO