Hacker News new | past | comments | ask | show | jobs | submit login
Do we think of Git commits as diffs, snapshots, and/or histories? (jvns.ca)
189 points by soheilpro on Jan 6, 2024 | hide | past | favorite | 210 comments



One issue with the "diff" model is that in GIT (and most all VCs these days) there is no "one true answer" to the question "how did this old file change to become this new one".

Instead, what is stored is "what is the smallest/fastest/simplest way to create new file from old file".

(depends on tradeoff)

Which is not the same as what you may have actually done to change old file into new one.

One result is that the textual representation you see in "diff" is just one interpretation of how the file changed - since it is reconstructing it from first principles.

I mention this because the author mention the following positive of the diff mental model: "most of the time I’m concerned with the change I’m making – if I’m just changing 1 line of code, obviously I’m mostly thinking about just that 1 line of code and not the entire current state of the codebase

when you click on a Git commit on GitHub or use git show, you see the diff, so it’s just what I’m used to seeing"

Of course, you see "a diff", not "the diff", because there is no "the diff". and when you change 1 line of code, the diff may or may not show it as having changed one line of code.

It is likely to in most cases because that usually meets the definition of longest common subsequence that common diff algorithms optimize for. But not always.

Which leads to interest results when you were thinking about a change one way, but the diff doesn't show what you actually did to the code in your editor :)


Does that really matter all that much, though? Often enough, I make a bunch of changes to a repo, but want to split those changes across multiple commits. When I'm staging changes before I make a commit, I'm doing so by looking at the diff (via 'git add -p'). By that point I don't really have any memory of how I made those changes; I don't recall details like "I deleted this line, and then typed this other line in its place", or "I deleted these characters and replaced them with these others", or "I added these characters to an existing line". Ultimately, when I'm looking at it, the diff that git generates is the change, to me.

And yes, I get that there's no "the" diff; git could change its algorithm between versions, and the diffs between commits might change (even though the end results of those diffs would be the same). But ultimately I don't really care about that. My overall thoughts about the change are qualitative, and are represented by what I put in the commit message. Whether the diff is represented as "3 deleted lines, then 4 added lines, and finally 1 deleted line", or "2 deleted lines, then 4 added lines, and finally 2 deleted lines" (or whatever) is pretty irrelevant to me.


I think tracking the "how" of a change can help with review. Currently I see 40 lines deleted in this file and 40 added in another. If this was a "cut and paste" then I may not need to review the code as deeply as it's not new, but I'd focus on if it works in the new location.


> Which is not the same as what you may have actually done to change old file into new one.

I’ve seen this with methods: I add a method, so I expect to see the whole new method in the diff. But actually the diff looks as if I’ve inserted something just before the closing brace of the existing method above the new one.

I’ve learned to detect this when reading diffs but a nice display would be, err, nice.


Git has a few (configurable) diff algorithms; 'histogram' is more readable (IMO, but that's the design goal - or rather is of 'patience' which it extends/speeds up) than the default Myers algorithm, but it does still suffer from the issue you describe IME.

Paper comparing Myers and histogram: https://link.springer.com/article/10.1007/s10664-019-09772-z

Naïvely it seems like it should be not that hard to fix (be less greedy?) or even that it's an example of a case where the algorithm should be less smart: it's such a common case, hardcode paren chars for some kind of exception or second pass. But that is a naïve take, I've never looked at the algorithm or source code at all.


This is sometimes called a “slider”. Difftastic has special logic to try to deal with sliders and get closer to what you want.


Oh yes, I need to remember to show diffs using difftastic.


You'll probably enjoy the patience or histogram diff algorithms. See https://luppeng.wordpress.com/2020/10/10/when-to-use-each-of... for more.

    [diff]
            algorithm = "histogram"
You can also teach git to always show e.g. function names in the diff hunk heading. For example, to always show previous Markdown heading:

~/.config/git/config

    [diff "markdown"]
            xfuncname = "^(#+\\s*.*)"
~/.config/git/attributes

    *.md diff=markdown whitespace=-blank-at-eol


Thank you; I have configured histogram and we will see.


I think you want the 'patience' option...

    --patience
        Generate a diff using the "patience diff" algorithm.


Git stores states of the repo across time. The diff is something created for the user. Internally, Git might or might store it in an optimized way, but not necessarily.


Ok so I think I at least understand now that “git diff” isn’t just showing me some stored information in the repo’s metadata.

But still, I can’t reason about how so many changes to a repo can be recorded in a way that’s so efficient that it’s almost imperceptible in time cost. And I’ve yet to see a satisfactory answer to this question. I need concrete examples.


Git is basically an object storage. Most objects in Git are blobs identified by their hash code. For example, a commit points to a directory. A directory points to all files and directories it contains.

With this foundation, Git is already quite efficient in common scenarios since most commits create only a few new objects: the commit, the new versions of the files, and all directory nodes all the way to the top.

Git could store all of these in the filesystem. Often, they are stored in indexed and compressed pack files though. But the underlying principle remains the same.


Git does not store changes it stores a tree with the whole state of the repo for every commit. This tree has references to the files, when a file is changed a new copy is always added and the new commit will reference the new tree with the reference to the new file.

When you ask git for a diff it compares the trees and when there is a difference it compares the files the trees reference and it shows you a diff.

Git does not store diffs, it stores file trees.


This random stackexchange answer is one of my favorite explanations of Git storage model: https://cs.stackexchange.com/a/149300/153403


If you do

  git add . && git commit -m wip
then git first checks which files have changed, taking the shortcut of checking if the files mtime is newer than the current index.

This is basically the speed of

  find .
For files that are changed it creates new trees and store it in the object store, this is cp + SHA-1

Then it takes all that metadata, comine it to one, add it to a commit message and it's done.

Doesn't take a lot of time to do these things, but it will for very large files.

Going the other direction is fast, reverse lookup all that metadata and copy the contents back into the working directory


Since I haven't seen any other responses actually say this:

Git logically stores file contents as blobs in essentially a key-value store. However, the physical storage writes many objects into a single "packfile", and for each object it uses heuristics to look for "likely similar" object candidates already in the packfile, computes deltas based on them, choosing the smallest possible delta or falling back to storing the object as-is if it can't find a good candidate.


This is the best, step-by-step explanation of how git works: https://tom.preston-werner.com/2009/05/19/the-git-parable.ht...


Simplest answer:

Start with a filesystem like this. They all have a root object with an id that is pointing to a tree. The tree is (name, type, pointer to data):

  root object 1: 
    dir1 directory  2
    dir2 directory  3
    file1 file 4
    file2 file 5
You make a change to file1, and commit it.

What it stores is the following:

  root object 6:
    dir1 directory  2
    dir2 directory  3
    file1 file 7
    file2 file 5
Note that our new root object stores a complete copy of the filesystem, but all we stored is the following:

1. The new root object, which is a bunch of pointers

2. the new file data

That is very fast to store, and very efficient to store and retrieve. It is not slower to access old data vs new data.

The data is all immutable, all you are storing is pointers to which data is named what.

If you had a file in dir1 instead, you would update the root object and dir1. dir2 would be untouched, so you would still store ... just as much as above.

This is how git works at a high level.

In practice, most commits change a small number of files. So you will update the root object, and all the directory objects on the way to your new file, but that's it.

Most source code trees are lets say 10 directories deep max? So even if you have a thousand directories, you are updating only the directory objects that are parents of your file's path. So maybe 10 directory objects get updated. All others are untouched.

So O(10) changed directory objects to store, plus the root object, plus the new file data. Everything else is reused.

This is very fast.

The worst case for something like the above is where you change a file in every directory. It will then have to rewrite every directory object, even for small file changes.

Or you have flat directories with huge numbers of small files. You then pay the cost of rewriting the directory objects all the time.

But for most source code layouts, the above works very fast.

Even without compression, the total space usage of your repo is O(total amount of changed data) rather than O(total amount of data represented by all root objects)

1 million revisions of kernel, at 4gig of data each, would be 4PB if you were storing complete copies at every revision.

Instead it's like 9 gigabytes or something, total.


> But still, I can’t reason about how so many changes to a repo can be recorded in a way that’s so efficient that it’s almost imperceptible in time cost.

It's because git does heavy deduplication, in the same way that functional, persistent data structures do heavy deduplication.

For example, if you have a tree data structure and one tree (let's call it tree 1) like this

      A
    /   \
    C    B
And a tree 2 like this

      A
    /   \
    B    D
Where A is a root node, and B, C and D are subtrees (which may be huge; maybe many megabytes each), what actually gets stored for tree 1 is something like this:

      A_tree_1
    /   \
    C    B
And for tree 2:

      A_tree_2
    /   \
    B    D
And ultimately when you store both, you have this:

        A_tree_1    A_tree_2
      /        \     /     \
     /          \   /       \
    /            \ /         \
    C             B          D
Ok that's not the best drawing but I think I got the point across: we managed to have two different trees (tree 1 and tree 2) that each share an identical subtree (the B subtree), and as such B needs to be stored only once.

(But note that the "spine" of the tree - the path from the possibly deduplicated subtrees until the root - will not itself be deduplicated. That is, even if A is identical between tree 1 and tree 2, you still need different A_tree_1 and A_tree_2, because one of them contains pointers to C and B, and the other contains pointers to B and D)

That's how persistent data structures work in functional programming (for example, in Haskell, if you are careful with sharing you can have two different trees share a subtree rather than making a deep copy them, lowering the memory usage)

And that's how objects in git work (which include commits, filesystem trees, and file contents, that are called blobs). Two different objects that share subtrees will also share storage. That's not a diff, that's just how things are stored in Git. But ultimately, whenever you need to diff, you only need to consider the diff between C and D; B is common between A_tree_1 and A_tree_2 and doesn't need to be diffed.

And the reason this works is that in Git, objects are immutable: you don't modify an existing commit when you do git commit --amend for example, you create an entirely new commit (and the old commit is still in the repository, until you run git gc to get rid of it - but it probably occupies almost no space, because its filesystem tree is probably quite similar to the other commit)

(That's how it works in functional programming too: persistent data structures are commonly used there because in functional programming, you don't modify existing trees, you create new ones whenever you want to do a "functional update"; but you deduplicate the trees so that this update can still be somewhat efficient)

And the mechanism this deduplication is implemented in Git is content addressing: every object is identified by a sha1 hash of its contents, and every time you try to store an object with a given sha1 hash, git will first verify if this object already exist and if so, it will reuse the existing object.

(This means that a "pointer" to a subtree, in git, is just its sha1 hash. So the reason that A_tree_1 and A_tree_2, in Git, need to be different objects, is that inside them there are different hashes: one has C and B, and the other, B and D)

(Deduplication in Haskell's data structures wouldn't use content addressing, but instead you manually tell that a given tree will be built with a subtree from another tree, and then it just works because no tree can be modified)

Note: content addressing is also how deduplication in filesystems like btrfs work. In those filesystems, the kernel will hash a piece of a file (an extent) and check in a hash table whether that particular extent is already stored somewhere. If yes, you don't need to store again. (you can disable online deduplication too, and do it offline, with a batch job)


Agreed, that's my point why the diff model isn't helpful at all to think about.


> Agreed, that's my point why the diff model isn't helpful at all to think about.

Similar to the particle-wave duality in physics, a commit is both a snapshot and a diff, and similar to physics, which way to look at it depends on context.

For example, if you are rebasing branch A onto branch B, then you will want to look at the current state of branch B (the snapshot model) plus the diffs of the relevant commits from branch A (the diff model).

You will get nowhere fast if you try to look at branch A using the snapshot model. (And looking at branch B using the diff model will drive you insane.)

How git stores the data is orthogonal to that, I kind of know in the back of my head but I don't really care too much. It's enough that git gives me access to both the snapshot and the diff aspect of a commit.


Rebasing is nothing more than doing operations on a graph of snapshots, most notably cherry-picking (which is defined in terms of a given commit and its parent). Snapshots are not "how git stores the data", it's not a technical detail; they are the central data structure you're operating on conceptually - also when what you end up with is a calculated diff (which you can easily calculate between any two commits, even those topologically unrelated).


Git could just as well store commits as diffs from other commits, and 99% of users would never notice.

It uses snapshots because that is efficient, but from a user perspective all common git operations look more like they are operating on diffs than snapshots.

When you cherry-pick a commit, the diff of the old and new commit will be similar, but the snapshots are normally completely different; that's the point. The same goes for rebasing, which is like applying a set of patches. Heck, if it goes wrong, you get merge conflicts, which wouldn't happen if you were just manipulating pointers between snapshots.

Git actually makes it quite difficult to manipulate commits as pointers to snapshots instead of diffs -- i doubt most git users even know about `git commit-tree`.

This is why i don't really get the "git is wrong and should work on diffs" crowd. If it did, the user experience would be 99% the same, unless you're manually editing your diffs before committing them.


> Git could just as well store commits as diffs from other commits, and 99% of users would never notice.

Git does delta compression, so in fact it does usually store diffs on disk. That, however, is a technical detail that doesn't actually influence the user and can be 100% ignored as long as you don't mess with its internal files by hand.

What the user actually operates on in the repository are snapshots, with diffs being merely an intermediate representation useful for factoring, reading or distributing stuff. Git is good at confusing the user that it works on diffs, but the sooner you realize that it's not true, the easier it will be for you to work with Git.

And while "git cherry-pick" (and in turn, "git rebase" too) seems like an automated "git format-patch + git am" at first glance, it actually goes further and is using three-way merge for better conflict resolution. It works on snapshots, not diffs.


I know how it works (what I wrote is technically correct); there's no need to be condescending. Anyway.

> Git does delta compression, so in fact it does usually store diffs on disk.

Yep. The comment you replied to originally was pointing out that snapshots and diffs are isomorphic, so i'm glad we all seem to agree.

> What the user actually operates on in the repository are snapshots

They don't, though. Git doesn't show you the tree IDs in normal operation, and you can't actually make commits that point to specific trees (snapshots) without unusual commands.

> it actually goes further and is using three-way merge for better conflict resolution. It works on snapshots, not diffs

It doesn't really matter what it's working on, the best mental model to understand cherry-picking and rebase is that of applying diffs. Can you (in general) even explain things like rebase and cherry-pick without the terminology of diffs? The git manual doesn't bother.


> there's no need to be condescending

Didn't want to, consider "you" to be plural in my last comment, or replace it with "one". I honestly believe that reasoning about commits as "diffs" leads to nothing but confusion.

> They don't, though.

They do. The fact that to write a letter you type each character separately doesn't mean that what you're operating on in a text editor are one-char diffs. You're composing a single letter to save - just like in Git, where you're composing a single state of the repo to then commit (or in other words, to snapshot it). How exactly you compose that state (by using index, or commit-tree, or subtrees, or placing files directly in .git, or...) is irrelevant to the resulting repository graph - and that graph of snapshots is ultimately the data structure that you're conceptually operating on (regardless of how it's represented on the disk).

You work on commits, not trees. Commits are snapshots of your files. Trees and blobs are just how Git represents your files - almost an implementation detail. From the user PoV, Git could even be creating new directories in .git with copies of your whole working dir for each commit and nothing would change conceptually, it's irrelevant to the high-level mental model of a Git repository.

> Can you (in general) even explain things like rebase and cherry-pick without the terminology of diffs?

A diff is a result of an operation applied to two snapshots. Cherry-pick executes that operation and uses the result of it (at least conceptually). You can't think of it as operating directly on "commits as diffs", because then things like cherry-picking a merge commit wouldn't make any sense, while they still make perfect sense and are easy to explain with the "commit is a snapshot" mental model. Some things in Git calculate diffs between two commits and use that result in some way, but that doesn't change the model of the repository.

And because Git often shows you diffs for convenience, it's easy to develop a wrong mental model of the repository - a model that most people operate on, but which will bite you sooner or later. That's exactly why so many people end up being confused with Git.


> things like cherry-picking a merge commit wouldn't make any sense, while they still make perfect sense and are easy to explain with the "commit is a snapshot" mental model

IMO it makes just as much sense either way. When cherry-picking a merge commit you have to specify which parent to diff against, which could just as easily be explained in terms of diffs (i.e. which part of an n-way diff to apply).

> it's easy to develop a wrong mental model of the repository - a model that most people operate on, but which will bite you sooner or later. That's exactly why so many people end up being confused with Git

This doesn't match my experience (as "that guy that people go to for git help"). Perhaps you have a concrete example.

Just to be clear, i'm not advocating for teaching or believing that git works in a way that it doesn't, that would be silly. More that being able to think about it in different (equivalent) ways in different situations is helpful.


As also "that guy that people go to for git help", I've seen people clearly surprised that:

- cherry-pick creates "duplicated" commits

- you can checkout a commit, rather than a branch

- two commits in the same repo may be topologically unrelated to each other

- merge commit can contain changes unrelated to its parents (usually after making some by accident)

- you can git reset --soft

Those are just the ones that came to my head on a whim (and don't get me started on rebase-heavy workflows). I've had people telling me that it "finally clicked" when made aware that commit is a state rather than a change. Explaining what I just did to "fix" someone's repo is also often easier after a proper "commit as a snapshot" prelude. If people who used Git for years say "oh, that makes much more sense now" after being presented with basic Git concepts, it suggests that their mental model may have been somewhat flawed.

Once you internalize that commit is a snapshot, going from that to thinking about diffs between two commits is easy. The other way around is not so obvious - when checking out a commit from complicated topology, it may not be immediately clear how to get from one state to another by "applying diffs" even if it's technically equivalent, so simple operations will end up seeming like undecipherable magic to you simply because they don't fit your mental model very well.


I don't see any of those things as working differently in a diff-based SCM, sorry. If those things conflict with someone's model, then that model is more/differently wrong, and not the one that i'm talking about, which is equivalent to how it actually works.


We're talking about models used by novice people to learn and understand Git, not about theoretical equivalency of mathematical graph transformations - the latter is a truism and doesn't bring much to the table, while the former will obviously be more or less flawed for a while until you become more experienced and fill in the gaps. People who may even have no familiarity with graph theory don't gain anything from thinking about commits as diffs in Git.

The whole thread started with a notion of "particle-wave duality", which is technically true, but useless in practice when "diff between two arbitrary commits" is one of the simplest operations to think about in terms of graphs of snapshots.


You claimed the diff was stored - that's what the correction is about. Git's primary storage model is of complete snapshots.


https://tom.preston-werner.com/2009/05/19/the-git-parable.ht...

Tom's 15 years old text remains the only correct, useful, easy to understand explanation of the why and how of git I've ever popularly encountered.

Post like the OP's do far more harm than good by introducing mental models comprised of best-effort confused half-knowledge.

If (generic) you ever wanted to "really learn git", do yourself a favor and read it.


When we ask git to show us a commit, it shows us a diff, because that’s the most helpful model.


If we want to ask git to show us a commit, we need to type

    git cat-file commit HEAD
`git show` does not "show a commit" as such, its man page is pretty clear about that:

       For commits it shows the log message and textual diff.


That’s exactly the point. git show HEAD shows the log and textual diff because that’s what matters most from the users perspective.

The internal representation doesn’t matter for most users and use cases.


> Of course, you see "a diff", not "the diff", because there is no "the diff"

10–15 years ago when I cared about this sort of thing and I was unhappy with the way that Mercurial diffed two sets of files, I'd take steps just before submitting a patch to edit it by hand so both I and the reviewer had a clearer picture that matched the actual change I was making, in contrast to some goofy-looking and confusing insertion/deletion combo keyed off an arbitrary closing brace or whatever.


> when you change 1 line of code, the diff may or may not show it as having changed one line of code.

Can you offer an example where an editor is used to change one line of code, and a resulting diff leads the user to believe that the number of changed lines is not one?

I don't think people interpret an adjacent opposing pair (one line removed, immediately followed by one line added) as not show[ing] it as having changed one line of code.


A diff like this is perfectly valid:

  -foo
  -bar
  -baz
  +foo
  +bar
Although I don't know any tool which would actually generate such a diff.


Oh! I was keeping my thoughts scoped to the output of existing tools. Beyond that doesn't seem worth considering in this particular context.


The existing tools already let you choose the algorithm used to generate the diff, so it's not a big stretch to imagine an algorithm that may do just this. It doesn't make much sense for such 1 line example (although it could if you had, say, an algorithm that considers whole paragraphs, or blocks in some programming language, rather than lines - could be useful for line art perhaps), but with more complex changes there won't be a single "optimal" representation, especially when what's "optimal" depends on what you want to optimize for.


> Instead, what is stored is "what is the smallest/fastest/simplest way to create new file from old file".

I think this may be misleading to people reading your comment. Git is a snapshot based version control system (Maybe you know this). Each commit contains a complete exact snapshot of the entire repository. On the other hand it does contain optimizations to reduce the size of these snapshots in the form of git packs, and in that sense your comment is technically true, but this is really just an implementation detail.


Even the fact that it's snapshots (but possibly compressed) is an implementation detail. I actually really like the generalized "smallest/fastest/simplest way to create new file from old file" explanation. "Fastest" was a primary goal of git. Same with Mercurial. Other distributed revision control tools that were designed around the same time (bzr, darcs) didn't optimize for speed and you could tell.


> I actually really like the generalized "smallest/fastest/simplest way to create new file from old file" explanation.

You might like that explanation because it fits a mental model you may have acquired previously, but at the user interface level, Git stores snapshots. It's not an implementation detail, it's what the entire command line is based on.


> at the user interface level, Git stores snapshots

Where in the user interface? All the common commands act like they are working with diffs.

It amazes me that people keep repeating this (there's another thread of the same stuff here https://news.ycombinator.com/item?id=38896658#38897488 ). It's like "actually git commits are snapshots not diffs" is such a powerful meme that those who repeat it forget about their day-to-day experience of working with git, in which it tries as hard as possible to hide the snapshots, and let you work with diffs.

The onion really has three layers:

2) In the user interface, all common commands work as if commits store a reference to a parent commit and a diff.

1) Commits actually store a reference to a tree (snapshot) and a parent commit; diffs are generated on-demand.

0) Trees (snapshots) may actually be stored in different ways, where "smallest/fastest/simplest way to create new file from old file" makes sense.

Most people live comfortably on layer two. Occasionally you might need to know about layer 1 to do something weird, and layer 0 is only really for people interested in implementation details.


It’s not a meme, it’s explained in one of the first chapters of the official documentation, in a section titled “snapshots, not differences”: https://git-scm.com/book/en/v2/Getting-Started-What-is-Git%3...

You and some others in this comment section are simply wrong about this. Everybody is wrong sometimes, that’s no problem. But your righteous tone (“meme”, “amazes me”) after clearly having spent zero minutes of research is not what I would consider a constructive way to engage in a discussion.


Yeah, i'm sorry about the tone, that wasn't my intent.

By meme, i don't mean that it's wrong, it isn't, just that in my opinion it's a thing that people like to repeat because they have heard others say it. That's my only explanation for why someone would say "at the user interface level, Git stores snapshots" when the interface is, i think clearly, mostly about working with commits as changes.

> You and some others in this comment section are simply wrong about this. Everybody is wrong sometimes, that’s no problem.

About what, exactly? As far as i know, what i wrote is factually accurate (or an opinion, which can not be wrong).

Yes, the git manual writes about snapshots, because it's literally true (and sometimes necessary to understand), but it also repeatedly refers to commits as changes, because that is a helpful and equivalent way to think about commits, modulo some details which mostly don't matter.


> That's my only explanation for why someone would say "at the user interface level, Git stores snapshots" when the interface is, i think clearly, mostly about working with commits as changes.

That's what the documentation says, no need to assume a conspiracy. I don't know on what authority you claim to know better, you haven't cited anything.

> About what, exactly? As far as i know, what i wrote is factually accurate (or an opinion, which can not be wrong).

Opinions about facts can be wrong. I've cited the source, maybe you have a better one than the official documentation, but you haven't given it.

> Yes, the git manual writes about snapshots, because it's literally true (and sometimes necessary to understand), but it also repeatedly refers to commits as changes

But it doesn't. Again, you didn't cite anything. Conceptually, and in the documentation, changes are always something that's computed after the fact. I've triple checked just now, the official documentation always talks about the changes between commits, or the change introduced by a commit (which means the changes to the parent commit) -- that's fine and in fact exactly the point. It never uses "commit" as a synonym for "change", or claims that a commit somehow stores a change.


> the official documentation always talks about the changes between commits, or the change introduced by a commit (which means the changes to the parent commit)

That's exactly what i was referring to. If you have a commit, that is both a snapshot (as stored in the tree reference) and a change (because it has references to parent commits, which are also snapshots). Of course the wording is factually accurate, but it refers to computing the change, because that's how it is best explained, and how those commands (rebase/cherry-pick/revert) work conceptually.

If you try to understand those commands without thinking about diffs, you can come unstuck.

Commands like reset of course don't talk about changes, because that's not relevant for how they work. Understanding both models and their equivalence is helpful, IMO.


Basically all atomic-commit based VCs store snapshots. That's how they achieve atomic commits.

Git is copy-on-write, so it's not storing full snapshots either, because it's space prohibitive.

All of the storage mechanisms (deltas, etc) are time/space tradeoffs. I don't think there is anything odd about that?

It's true in lots of things beyond VCen


Aren’t there situations with divergent branches where you end up wanting to apply or rewind commits but cannot do so because applying that function is invalid because the difference between commits is just that, a difference, and not a full snapshot of the repository.


No, the previous post above is correct:

> Each commit contains a complete exact snapshot of the entire repository. On the other hand it does contain optimizations to reduce the size of these snapshots in the form of git packs

I think the problem that you describe is that the divergent branches have made changes in the same file that cannot be automatically reconciled?


I disagree, snapshots is the fundamental model. Otherwise, shallow clones would not be possible.


Of course they would be possible, just harder and less efficient to implement.

The snapshot representation can be transformed into a chain of diffs representation and the other way round without losing anything except performance.


It's _a_ way that an algorithm found out of the infinitely many edit scripts representing the same change; the underlying nitpick in this thread is that an algorithm found it and maybe it's not a great choice (we've likely all seen diff algorithms making some odd choices in the presence of closing braces, for example). But there's not particularly any adjectives to be expected about it.


I"m not sure why it's misleading?

I said smallest/simplest/fastest.

Fastest is fulltext or pointers to existing objects. Smallest is deltas from other objects.

If your complaint is that i'm somehow implying it's file based, i'm not. But as we'll see, your claim that it stores complete exact snapshots is wrong. It is correct at the UI level, but we aren't talking about the UI level, but the storage level. Git is object based, where most objects are files are directories, and it does in fact choose between fastest/smallest/simplest to decide how to represent changes between trees. One way it reduces space is to be copy-on-write, which is why it doesn't store exact snapshots.

I do know git (at the ui level) is snapshot based - i worked on VC systems for about a a decade, and remember when git was born. I also used to sit next to Shawn Pearce (RIP :( )and Junio Hamano's office for 7 or 8 years, so i acquired lots of git knowledge by osmosis.

On my own I go far enough back to have watched Graydon work on monotone before rust ;)

You are correct it is a snapshot based VC, but all parts of the storage are implementation details as well.

The last non-snapshot based VC to have serious followership was CVS, which used individual RCS files and non-atomic commits.

With atomic commits came snapshots. Even SVN was a snapshot based VC as well - SVN servers stored things abstractly the same as Git does - each commit was a copy of the entire FS, with pointers to unchanged objects.

More importantly Git is copy-on-write, so it also doesn't store the complete exact snapshot of the entire repository. it stores the changed objects since the last tree, with pointers into that last tree for unchanged objects.

So what you wrote is not correct either.

This is why, say, corruption in older commits can break newer commits.

This is also theoretically an implementation detail. As mentioned SVN had an FS abstraction that presented each commit to the rest of the system as a copy of the filesystem, like git does. The different storage backends under FS chose to store the FS in different ways - some deltas, some plaintext, some pointers to existing plaintext.

However, none of these VC would be really useful if they didn't trade time for space. If they actually stored "a complete exact snapshot", the space usage would be insane.

To whit - a "complete exact snapshot of the entire repository" would be space-prohibitive. the kernel tree has over 1 million commits and the tree is ~4 gigabytes. Without some of the "implementation details", it would be unusable (since this is 4PB, i'm going to assume the fact it was smaller in the past is in the noise).

All storage in VCen is an implementation detail. Git exposes its implementation details more to users than others do.

Git doesn't really have clean abstractions, and because it abstraction breaks at various levels, it's hard to talk about in the way we are here without saying something that is "wrong" at some level.

For example, git is really object based, not snapshot based, at the lower levels. It's basically an object store of different types of objects, some of them trees, etc. To the user, it appears mostly snapshot based, but you can go mess with the objects in various ways.

SVN (again, for example), has as it's main abstraction is an FS snapshot, and the files/directories/things in it. There is no access to a lower level abstraction like you can in git.

This is actually snapshot based - there is no abstraction breaking.


Put another way - if you think about something similar with cleaner abstracts - we'll use an FS with O(1) snapshots (BtrFS, ZFS, etc)

At the UI layer - to the user they see a snapshot as a complete, exact copy of the filesystem for each snapshot. How this happens is an implementation detail. As far as the user is concerned the system is storing a complete, exact copy of the FS at each snapshot.

At the FS layer, is a snapshot stored as a complete, exact copy of the filesystem? Absolutely not :) It is stored as a pointer to some root object most of the time. If you go further down, the objects may or may not be stored complete. They may be compressed. They may have portions of the object shared with other objects (dedupe and block cloning/sharing), etc.

As you go further down the abstractions, at some point it's not an implementation detail but instead fairly fundamental to the system.

In this case, as with git, the system could not afford to store complete exact copies of the FS, even if it had a way of doing so in O(1) time, which isn't possible in our universe anyway.


Yeah, I know your username, and I'm sure you understand git internals and CAS, etc. I still think your post is misleading to readers as most will walk away thinking git is stacking up deltas on each commit. This is evidenced by three people trying to correct you in this thread. Regarding CAS/CoW, I don't think that is any less of a complete exact snapshot. There are no deltas involved, and the representation is completely independent of any other commits (other than the parent commit hashes).


This is an excellent comment and one should also note that the claim is wrong. If one line is changed "a diff" will likely have two lines: one removed, one added. If you change a sentence "I am studying tonight" to "I am not studying tonight" git diff will show one line deleted when I actually didn't delete anything, conceptually.


> This is an excellent comment and one should also note that the claim is wrong.

I’ll pretend that this isn’t pedantry. I can understand why you’d say this, but I think you just haven’t explored the way that git diff works, including the configuration options it takes.

  $ git diff 'HEAD^'
  diff --git a/file b/file
  index efb8e76..ca03496 100644
  --- a/file
  +++ b/file
  @@ -1,3 +1,3 @@
   This is context.
  -I am studying tonight.
  +I am not studying tonight.
   This is also context.
Try it with --word-diff…

  $ git diff 'HEAD^' --word-diff
  diff --git a/file b/file
  index efb8e76..ca03496 100644
  --- a/file
  +++ b/file
  @@ -1,3 +1,3 @@
  This is context.
  I am {+not+} studying tonight.
  This is also context.


You can also specify format knowing about languages: https://git-scm.com/docs/gitattributes#_defining_a_custom_hu... Or even whole diff converters.


`git diff --word-diff` is wonderful for this

  diff --git 1/old 2/new
  index 7164d6e..61a276c 100644
  --- 1/old
  +++ 2/new
  @@ -1 +1 @@
  I am {+not+} studying tonight


Wow, why isn’t this the default?!


git diff by default outputs valid patch files that you can feed to patch out git apply.

--word-diff does not

Kernel development involves working with patches a lot, so it made sense to default to that, and --word-diff was added later anyway.


Ah, makes sense, thanks for the explanation.

I’ll have to create an alias for my use!


> GIT

Just out of curiosity: What do you think GIT stands for?


I don't spend a lot of time editing my comments for word correctness as it's HN, not something that matters.

So you will find plenty of occasional misspellings, etc.


Gust in Time


Gee, It's Torvalds


As an open source project it needs to be recursive:

    GIT: It's Torvald's


GIT: GIT Is Torvald’s


If you care enough about it, you add a comment and it shows up in the diff?


The exact way in which git handles commits is very muddied - it's snapshots on the surface, a bit of diffs when packed and a lot of operations on commits are actually 3-way merges (including merges, rebases, cherrypicks and reverts). Keeping track of all these matter (esp the operations that use diffs), but it can also get overwhelming for a tool.

In my opinion, it's probably good enough to understand the model git is trying to emulate. Commits are stored more or less like snapshot copies of the working tree directory with commit information attached. The fact that there is de-duplication and packing behind the scenes is more a matter of trying to increase storage efficiency than of any practical difference from the directory snapshot model. Meanwhile, the more complex git operations (merges, rebases, reverts, etc) use actual diff algorithms and 3-way merges (way more often than you'd imagine) to propagate changes between these snapshots. This is especially apparent in the case of rebases, where the snapshot model falls completely on its side (modifying a commit will cause the same change in all subsequent commits).

This actually makes sense if you consider the development workflow of linux kernel before git. Versions were directories or on CVS and a lot of development was based on quilt, diffutils and patchutils. Git covers all these use cases, though it may not be immediately apparent.

Added later: It's also interesting to look at Mercurial's model. Like Git, Mercurial uses both snapshot and diffs for storage. But unlike the Git way of layering these two, Mercurial interleaves them - as diffs with full snapshots occasionally. This is more like the video codec concept of keyframes (I think that's what inspired it). This means that Mercurial, unlike Git, doesn't need repacking. And while Git exposes its internal model in its full glory, Mercurial manages to more or less abstract it away.


Well-said, although I disagree that it's "muddled".

The data model is that commits are snapshots, and diffs between snapshots are computed as needed. The whole system is designed around this.

Packing is an implementation detail.

The fact that internally it can store snapshots as diffs is more or less unrelated to the user-facing diffs. IMO it's confusing to even mention it in an educational context, except in response to the question of "how does Git prevent repo size from exploding?".


> Packing is an implementation detail.

It's so much of an implementation detail that even if the pack has a diff/delta between the two objects to diff, that WON'T be used to produce the output from git diff.


It's so much of an implementation detail, that git didn't have packing at first! All it had was loose objects ("disk space is cheap"). Packing was later added as an optimization, but the object model is still the same. It doesn't matter whether an object is in a pack file or not, it's treated the same.


> although I disagree that it's "muddled"

I understand. I meant muddied (not muddled) in the sense that it can be confusing for beginners. For some reason, many long-time git users also don't seem to progress beyond the initial image they have. (That includes me too - I struggled with rebases for a long time). That description wasn't a criticism of the git model. Git model is clear if you take some time to study it.

> Packing is an implementation detail.

> The fact that internally it can store snapshots as diffs is more or less unrelated to the user-facing diffs.

My point exactly! To summarize, a git user needs to remember only two things:

1. Git commits are modeled as snapshots of work tree.

2. Many operations are (user-facing) diff-based.

Every other detail is a finer implementation detail that's good to know but not essential to get started.


The thing that's muddied for beginners are bad YouTube tutorials (which the internet is full of), not Git or the actual documentation. People should really read the Git documentation, it's very well-written and explains the correct mental model.

Also, people really shouldn't teach implementation details to beginners. Or intermediates. Perhaps anyone who casually mentions that Git stores diffs to anyone not currently opening the source code for Git itself should be disqualified from ever giving explanations for technical stuff ever again.


I agree with your point about the official Git documentation. It is the only one I learned from and it's easy and comprehensive partly due to the involvement of actual git developers. But there is one area where I wish they stressed a bit more. Git documentation talks about the snapshot model so many times - you're never left in doubt how it's stored (including packing). But they don't stress particularly upon the fact that rebases, merges, cherrypicks and reverts are based on diffs (3-way merges). For example, I was expecting the 'drop' operation in interactive rebases to just delete that commit and leave all the subsequent commits intact (except for the DAG linkage). But to my surprise, the change introduced by that commit disappeared from all subsequent commits - leading me to suspect that they were using diffs in this stage. I eventually found a single confirmation of this in the official documents. But it's obscure. In fact, I tried and failed to find it for reference in this reply.


That's a great point, and I think we all agree that the documentation does a poor job of distinguishing between when the "snapshots" and the "diff" models are in use. But what is never exposed in the docs or user interface is the internal implementation details of how snapshots are stored. And that's what I was arguing is not one of Git's many documentation and UX problems.


> But to my surprise, the change introduced by that commit disappeared from all subsequent commits - leading me to suspect that they were using diffs in this stage.

Good way to think about rebase is that it's nothing more than automated reset and cherry-pick. You can rebase by hand without using `git rebase`, it's a convenience tool just like `git bisect`. `drop` does nothing, and removing the line does the same thing - it's just not cherry-picking that particular commit, skipping right to the next line. You can even add new lines to an interactive rebase and cherry-pick completely unrelated commits this way.

Once you know how cherry-pick works, rebase (and revert) becomes clear too.


It’s confusing right up until it isn’t. If you rename a file, and edit it then it is helpful to understand that whether that is shown as a creation and a deletion or as a rename plus an edit is immaterial to some parts of git, and yet very important to others, and can change if you squash or rebase.

Git’s abstractions are pretty leaky.


I agree -- so many of the advertised advantages of git depend on operations on diffs, especially the confusing ones that people find difficult to learn, which makes it very confusing for beginners and casual users when they hear "commits are snapshots" said in a tone that seems to imply that thinking of them as diffs is an abhorrent error. Yes, understanding that they are conceptually snapshots is useful, but git wouldn't be git if it didn't do a ton of work on diffs day to day.


  > This is especially apparent in the case of rebases, where the snapshot model falls completely on its side (modifying a commit will cause the same change in all subsequent commits).
I disagree. During a rebase is precisely the time the diff model is problematic. A modified commit does not cause changes in subsequent commits.

Modifying commit A is modifying that commits snapshot into A~.

Now the subsequent commits will be cherry-picked on top of A~.

If there are subsequent commits with changes that depends on A, you have a merge conflict.

A~ does not cause changes in B~, C~. The changes of B are applied on top of A~ becoming B~, the changes of C are applied on B~ etc.

Thinking of commits as diffs during rebase is a recipe for confusion


What you've described is a bunch of operations that apply diffs.


You missed the point I was making about snapshots and diffs. In git, the identity of a commit isn't a diff/change. It's a snapshot. Many operations like commit, push, fetch, etc require you to think so too. Based on that definition, the commits are essentially changed if snapshots change - even if the change introduced by them remains the same.

It's clear by your own definition that B~ and C~ are not the same snapshots/commits as B and C. They have absorbed the changes from A to A~ (or the delta from A to A~ is now reflected in snapshots B~ and C~). The fact that diffs on B and C remained the same in B~ and C~ is irrelevant to the commits' identity.

> Now the subsequent commits will be cherry-picked on top of A~

Here is the important point. Cherry picking is implemented as a 3-way merge. It involves actual diffing algorithm.

> Thinking of commits as diffs during rebase is a recipe for confusion

Here again, there are two issues. I didn't say that commit have to be thought of as diffs. I said many operations (incl rebase and cherrypicking) use diffs to propagate changes between snapshots. This reasoning is necessary to understand why snapshots B~ and C~ are different from B and C.

The second part is that thinking of rebases in terms of diffs is far from a recipe for confusion (3-way merges actually, but diff is an easier approximation). It actually help me understand the operations and allowed me to predict the results of different operations in advance. That single realization actually made Git far more approachable for me and gave me the confidence that I can solve most Git issues without having to delete the copy and cloning it again.


I think it's great if thinking of commits in the context of commits in a rebase as diffs works for you. I only caution against it because there are many situations during a rebase where the results can be very confusing with such a perspective. Precisely because a 3-way merge can make things much more complicated.

I think you're muddling the concepts of tree (a snapshot) and commit somewhat. A commit is not merely a snapshot, it's a tree as well as metadata.

> the commits are essentially changed if snapshots change - even if the change introduced by them remains the same.

If by commit you mean tree, then yes. One can think of B and B~ "introducing" the same changes if the diff between A and B is the same as A~ and B~.

For example, say you add a new file in A~ and then cherry-pick B on it, the tree of B~ will not be the same as B, but the diffs of A and B will be the same as A~ and B~.

The main reason I caution against this perspective is that you can easily end up "introducing" other changes when you reorder commits.

Change A-B-C to A-C~-B~ and very often you'll find yourself "introducing" changes from B in C~

That's not too say that doing git show REBASE_HEAD, to view the diff of B-C isn't a bad idea, just that thinking of commits as diffs during a rebase, imo, is often a false friend


> I think you're muddling the concepts of tree (a snapshot) and commit somewhat. A commit is not merely a snapshot, it's a tree as well as metadata.

My intention was to approximate definitions to the bare essentials without losing too much fidelity. This criticism feels like a nitpick (apologies if that wasn't your intention) because the metadata was implied as it's well understood.

> If by commit you mean tree, then yes. One can think of B and B~ "introducing" the same changes if the diff between A and B is the same as A~ and B~.

That is the diff model. You are cautioning against treating commits as diffs during rebasing, and yet insist on using that definition to oppose my notion. My stand is a bit more consistent here. Treat commits as snapshots. But rebase and similar operations use diffs on those snapshots.

> Change A-B-C to A-C~-B~ and very often you'll find yourself "introducing" changes from B in C~

I find this claim bizzare. The change works exactly as expected when viewed as diff (3way merge) operations. The diffs introduced by C and (B -> B~) end up in commit B~ (tree snapshot + metadata and whatever else necessary - just to be pedantic).


> is “how Git implements it” really the right way to explain it?

Of course not, and nowhere outside of software engineering will you find such widespread confusion between concepts and their implementation.

Do we think of pressing the gas pedal in a car as "accelerate", or as "regulate the intake manifold"?

Commits are one of the most high-level concepts in Git. If an explanation for them resorts to implementation details, it's a bad explanation.


It's funny you mention that, because Design of Everyday Things has a whole part about how people abstracting away the implementation leads to people having extremely wrong ideas about how things work.

I think abstraction is good in theory, but at the end of the day the implementation is what is happening. The classic thing of people relying on undocumented behavior, etc etc.

What good is a conceptual representation that doesn't actually align with what happens? It's not like the gas pedal being pressed down is actually what makes a car accelerate, after all! Whenever you're dealing with something _not_ working as expected, those who know the implementation are going to be in a much more comfortable position in general IMO.


>I think abstraction is good in theory, but at the end of the day the implementation is what is happening.

Abstraction is absolutely necessary in practice. It's the language we use to tell a system what we want it to achieve. It's the only way for us to know what the system can do without starting a science project. It's the only way for the implementer to know what to implement. And it's the only way to check whether a system actually does what it's supposed to do.

Of course, knowing how a system actually works is always better than not knowing it. Every abstraction is leaky as so many side-channel attacks clearly demonstrate. But that doesn't mean the abstraction is some theoretical pie in the sky that we can live without.


Is it actually a problem that people have the wrong ideas about how things work?

Sure, if they need to actually know how something works (because, say, they want to dig in and modify the system itself), then of course they need to understand what happens below the abstraction.

But if my goal is "drive a car", then it may not be helpful to know anything about that pedal beyond "car starts moving when I press the accelerator pedal". Because the nice thing is that I don't need to know about fuel injection or whatever happens when I press the accelerator on an ICE car. And the really nice thing is that if I switch from an ICE car to an EV, the basic usage is the same, and it doesn't matter that something completely, entirely different happens when I press that same pedal on those two different cars.

(Ok, well, now we have one-pedal mode on some EVs, but... otherwise...)


The problem with models is that once you've come up with a mental model, you will make assumptions about what happens when you do certain actions based on that model. And even if your model is very good, there will pretty much always be cases where your mental model differs from real life.

For example, with the car analogy, you could have a very simple model that says that the accelerator pedal makes the car go faster - the harder you press the pedal, the faster the car goes. This model will break down, however, when you start driving in rough, wet, or snowy conditions. For those conditions, you need a more complex model that can take into account things like torque and grip. (I think: this analogy is taking me to the limit of my car knowing!)

I worked on a project involving a device that could be configured. The backend developer's mental model was that the device was always in a configured state, and that you could then save the current configuration as a kind of preset, or replace the current configuration from an existing preset, or just change the configuration on the fly.

The users, however, had a mental model more like files: they would load a configuration, make changes to that configuration, and then select a different configuration, expecting the previous configuration to be updated. In the end, these two mental models proved to be too incongruous, and we needed to switch the backend to use a different system internally that better matched user expectations. (Another option would be to build a UI that made the backend model more clearly.)

In the case of git, the problem to me is that both mental models (diff-based and snapshot-based) are true and shown in different places. If you rely entirely on the diff-based model, you can use git very well, but you'll get weird errors when you start cherry-picking commits. If you rely entirely on the snapshot-based model, you'll be fine up until you want to rebase something, and then git will feel like dark magic. Both of these models can represent a lot of git operations, but if you take them too far, both will allow you to make a number of assumptions that aren't true.

(I think there's also a nice analogy to physics as well here: Newtonian physics is a really good model of the way the universe works, it allows you to do a lot of very precise things with maths. But if you try and build a GPS system using only Newtonian physics, you'll find that your model doesn't match up to reality any more.)


RE: The users had a different mental model

The problem seem to be that the interface wasn't clear about whether it would save implicitly or explicitly.

Which makes me think leaky abstractions aren't so bad as long as they don't facilitate a mental model that works opposite to "what it leaks". Essentially it seems fine as long as it's close enough, like in the example of the car.


> Is it actually a problem that people have the wrong ideas about how things work?

It can become a problem for when someone wants to make the switch from the consuming side to the producing side of things, or even just be part of a team that wants to do that. Every person first needs to be able to identify ones own faulty concept as misconception, and any decision made during this process involves a lot of risk.


I have to say this is the first time I've heard accelerating by gas pedal described as undocumented behavior.

> It's not like the gas pedal being pressed down is actually what makes a car accelerate

In proximity to the driver it certainly is the cause.


> It's funny you mention that, because Design of Everyday Things has a whole part about how people abstracting away the implementation leads to people having extremely wrong ideas about how things work.

Which, funnily enough, isn't a problem at all, unless the abstraction is leaky, or the purpose of the thing itself is poorly explained.

The abstraction is "the way things work". If not, something is fundamentally wrong with the design.


Every abstraction is leaky.


Sure, but the amount of leakiness varies a lot. And in practice, most of the time we just don't need to care.

Of course the canonical, annoying examples of the leaky abstractions involve software stacks, where we end up writing a bunch of code, and find that things work great until weird edge cases happen, and then we get upset and have to dig through the abstractions to figure out what's going on.

But the leaks in a lot of real-world abstractions just don't matter. Like the accelerator pedal in a car. That one does leak a bit in some places: for example, pressing the pedal isn't linear. On a (usually ICE) car with a geared transmission, pressing the pedal hard when in gear one gives you different results than when you press it hard when in gear six. But then you switch to an EV and the response is different. Or even switch to a different ICE car with a different transmission, and the response is different. But ultimately that doesn't really matter; the person driving gets used to it in very short order. The leaks in that abstraction just don't matter all that much.


> That one does leak a bit in some places: for example, pressing the pedal isn't linear.

That’s the least of your problems, because it leaks profusely. We could start with the fact that in many scenarios the “accelerator” pedal does not, in fact, accelerate. This is not an edge case—any driver who had to drive on ice, snow, dirt or otherwise not have traction knows that. What it does do reliably is make the wheels (some or all, depending on your car) go faster.

> But ultimately that doesn't really matter; the person driving gets used to it in very short order.

Let’s recap. I believe you were saying that the abstraction is leaky, the system does not always do what we expect, and even though many of us don’t really care to understand how the system works it’s OK because we sort of adjust and roll with it because for all its faults the system is good enough (and, perhaps much more importantly, we’re used to it). Hold on, it’s almost as if you’re talking about one well-known version control system…

…and on balance I’d agree with you, except I think a serious software engineer is more like an auto mechanic or professional racer rather than a casual driver—“press the pedal to go faster” doesn’t really cut it, and we should have a good and correct model of how Git works.


Not true in practice. Billions of websites work just fine when viewed on wildly different platform stacks, without developers or users ever having to care about differences in CPU architecture or kernel design.


That’s interface compatibility, not abstraction. For your example, you can just as well say that billions of cars work well on wildly different types of roads.

Abstraction as in the throttle example is more like ORM vs. SQL (we all know someone who ran into issues with the former in a non-trivial project). You simplify the idea of X to Y, which works in Z% but breaks down otherwise.


Not understanding how things work can result in very suboptimal decisions, and throttle control is a good example.

Even in a good old ICE car the throttle is not as straightforward as “more ‘gas’ equals more acceleration” (depending on many factors, such as road surface conditions, it can also make you go slower or stop completely).

An airplane would take it to a whole new level. What throttle does is dependent on fuel mixture, air density at your altitude, the configuration of controls—in fact, it’s probably much easier to enumerate the limited scenarios in which throttle actually makes you go faster as opposed to straight up crash.

There can be made an argument that smart systems could/should, with their layers of abstraction, maintain an illusion of ‘gas’ = acceleration, and that human should not have to know anything else; but would you trust your life to such an arrangement?

There can be an argument that stakes are high in case of air travel, and low in case of VCS; but that falls apart if you consider source code managed in a VCS can indeed power a life-critical system, and if not knowing well enough how the VCS works can lead to development overhead and eventually bugs in that system, then human cost is real.


The purpose of the VCS is to abstract away how its concepts are implemented. It should be possible to replace the standard Git implementation with any other that provides the same external interface without the user being affected, or even noticing it.

If that isn't possible, the problem lies with Git, not with users "not understanding" it.


That's going a bit far. Git is heavily optimized for speed. If the implementation changed the UI could possibly stay the same but users would notice.

Now, aside from that I think I generally agree with what you are saying. Git's UI, especially the way things are named, does feel very much tied to the implementation in ways that are a problem.


Have you read the later book Living with Complexity?

It is pretty much this topic; how to reconcile complexity and confusion.

He talks a lot about reaching designs of irreducible complexity as a way to mitigate confusion.

For what it’s worth, I’d say that Git is pretty far from irreduciby complex and thus has preventable confusion baked into it.


See: LLMs and intelligence


People who must diagnose, modify, optimize, or establish procedures for operating the car think of it in the context of the system. Meanwhile, people who strictly want to move forward think of it as the accelerator.

Likewise for engineering tools. If you'd like to use a simple abstraction as your mental model, you'll be able to operate the tool in the basic sense without issue. But as engineers, when something goes wrong with our tools we often don't immediately bring it into a mechanic. We need our tools to do very specific things, and we must understand what they did each time we perform an operation. We establish workflows that build on top of the tools and depend on the tool matching our precise expectations of its behavior. In essence, we are the mechanic for our own car.

And of course, our needs are much more diverse than "make car go forward". So inevitably, we run into trouble more often. What user-facing function of a car is as precise, powerful or intricate as a textual merge or reconciling two diverged work histories? There are many ways those operations can be done, and questions the user must answer about their desired result.


Based on your first paragraph, it's not all engineers who need an indepth understanding of git. Only git engineers.


> is “how Git implements it” really the right way to explain it?

>> Of course not,

As long as the user is a software developer I would disagree. If the developer cannot understand the implementation concept style of explanation, I don't want to use or review their code either. I am saying implementation concepts as object storage, branches being pointers to objects etc. Of course implementation details and optimizations like packfiles don't matter.

Every user-friendly explanation will have its limits once we get merges, rebases etc. For software developers it is important that their mental models don't hit such limits.

If the user is not a software developer you have a point.


> Do we think of pressing the gas pedal in a car as "accelerate", or as "regulate the intake manifold"?

I'm just curious -- which of the two do you think is the correct mental model to safely drive a car? If you think its the former, my guess is you've never driven where there are mountains and/or never driven manual.


I've driven in mountains (I'm actually in the mountains, with snow on the ground, even, right now) and for 15 years drove manual (I sometimes miss it, but mostly do not, in the same way that I run Debian now and can't be bothered to run a distro like Gentoo anymore), but I literally never thought of pressing the gas pedal as "regulate the intake manifold". I honestly don't even know what an intake manifold is, though I'm sure if I cared enough, Wikipedia would tell me.

Certainly my understanding of it is a bit more nuanced than simply "accelerate", but ultimately.... "press pedal harder, car moves faster" is... fine? Sure, you have to consider road conditions and (if driving manual) your current gear, speed, RPMs, etc., but... I dunno, I still think "press pedal harder, car moves faster" is a useful mental model to have, to some non-zero level of approximation.

Put another way: if I were to think of all the details of what the accelerator of a car actually does under the hood every time I pressed it, and consider the implications of it and allow that to inform how I press the accelerator, I would have precious little time to actually drive the car... like, the things that actually matter, like steering and staying on the road and avoiding other cars and people and whatnot. When I press the accelerator, I think about the effect it's going to have on the car. And I have a good idea of what effect it's going to have based on my experience driving that car (or any car, really). Not based on any knowledge -- knowledge that I lack! -- of what pressing the accelerator actually does.


The thing I tried to hint at is this: When you approach an incline, you'll naturally push down the pedal to keep your speed. Thus, even if you don't articulate it that way, your model of the gas pedal has nothing to do with acceleration. Yes, "intake manifold" is ridiculous, I grant that, but it's much more likely that the model people develop has to do with something like "effort".


Maintaining speed on an incline is still acceleration. Gravity is just also accelerating you in the opposite direction. The problem here isn't with the parent's mental model of how pressing the pedal is supposed to work, it's with your mental model of what acceleration means. It doesn't mean the number on the speedometer goes up.


The correct way to safely drive a car is intuition – not a "mental model" of anything.

That's why experienced drivers drive so much more safely than novice ones. It's certainly not because in the meantime, they somehow learned how the throttle works internally.

The same basic lesson applies to software. The best design is the one that lends itself most to human intuition. Because when saving a photo from an image editor, users sure as hell aren't thinking "now an inode is created with this and that metadata".


Intuition implies a mental model, just not a conscious one.


I think if it as "regulate the intake manifold" or revving the engine because I feel it makes more sense. I don't really understand the relationship between how much you push the pedal and how much acceleration it produces. I've given up on trying to understand that, so I don't think of it as acceleration.


I'd argue that you intuitively do. When driving, you know when to press it lightly, and when more, as well as what the expected result will be.

That would mean, I think, that the implementation details are irrelevant, because it's abstracted well enough to serve its purpose - and it's so easy that usually the FIRST driving lesson takes you into actual traffic (at least in Poland).


I disagree. I see it as a control system. I have a target acceleration and I have to change the input to the system to hone in on my target acceleration. I don't intuitively know how to accelerate at 10 mph / s. But by watching a guage I could be the control system that tries to do so.


Both need to be known by someone but I still think this is a useful insight. The implementation details are key if you have to troubleshoot and want to know your tools at a deep level. It's also useful for anyone trying to implement their own version of a product with the same objective from scratch. You want to know the history of how something has been implemented by other attempts. That is a key priniciple of engineering in general.

But ultimately, the implementation should involve choices and the possibility of change. git commit is saying the current state of the staged tree should be recoverable by any future user of this same repo. There are many possible ways to achieve this goal and how git happens to currently do it is worth knowing for some people. It's certainly been fascinating to read the months of output Julia Evans has put into this git plumbing exploration series. But I don't think it's critical for every single user of git to understand at this level of detail exactly how git works, and in principle, the developers of git should be able to change this without changing the semantics of what it means to commit in any user-facing way.


In German (and curiosly also in Chinese) "adding" or "giving gas" is the common idiom for accellerating. Also sometimes in general for putting in more effort.

Git actually presents a very simple view to the user: repository states across time. These states can be compared to each other.


> Git actually presents a very simple view to the user: repository states across time.

That doesn't explain commands like rebase or cherry-pick or even patch. The diff algorithm that generates diffs based on history is very much integral to git, even if it is configurable. There are plenty of git commands that assume you can meaningfully transform a commit into a diff from its parent(s). So commits are sometimes snapshots sometimes diffs, depending on the tool you're using to interact with them.


That's true. However, that's implicit in the way these commands are defined. It has to be that way to make working with Git efficient, which is optimized for common use cases. For example, to cherry-pick a single commit.


Or even worse -- regulating the manifold pressute. A resting gas pedal means the engine is suctioning a vacuum in the manifold, whereas a fully depressed pedal means the manifold is restored to atmospheric pressure!


> restored to atmospheric pressure!

Cries in sad turbo noises


> the merged commit can actually be literally anything

That cannot be stressed enough. Years ago a searched for a weird bug and could not find it. Of course I did not look at merge conflicts because "they don't really introduce anything new". But it turned out that a merge commit had introduced an arbitrary line, not related to its parents.

Also the opposite is true: You can make a conflict free merge of two correct parents and the result is broken code. It does not seem to happen very often in real life, but it's a risk one should be aware of.

I wonder how many merge commits there are in the Linux source tree that introduce something new by conflict resolution or break something existing without conflict resolution involved.

At work we are a small enough organization that we don't really merge, all our merges are fast-forward. But that means you move the risks above to rebasing.


> You can make a conflict free merge of two correct parents and the result is broken code. It does not seem to happen very often in real life, but it's a risk one should be aware of.

Correct and I would say even more than that; in a system more than minimal complexity you need to run your tests on your branch with main merged into it before you merge your branch into main. Too often CI is set up to only test HEAD of the new branch.


Yes! It is driving me insane that this is not the default setup. If you do the merge of main into the branch yourself, there is a race condition where by the time the tests are completed and you want to merge the branch into main, main has changed again, so you have to repeat. Exactly because of that it would be nice if CI guaranteed a correct sequencing of merging and testing.


You need to use a merge queue (aka merge train). It should be completely standard at this point (though it unfortunately isn't). Doesn't help that it's a Gitlab Premium feature.


Ah, that's what a merge train is. I always found the term confusing. We pay for Premium, but we don't use merge trains. We use only fast forward merging, so I guess it's a no-op in that case.


It's not a nop. Merge queues solve the following issue:

1. Author 0 makes branch_0 and tests it in CI. It passes.

2. Author 1 makes branch_1 and tests it in CI. It passes.

3. They both merge their branches.

This can break `master`, irrespective of the kind of merging you use. Fundamentally two people can make changes that are valid (pass CI) independently but not when combined.

That kind of breakage gets more common the more different people you have working on a project. If you just have a handful of people it's relatively rare so you can just accept `master` breaking occasionally. If you have a few dozen people or more then it's pretty much a requirement.

I don't think there are any serious downsides to enabling it.


If you are limited to fast forward merging only one of your authors can merge. The second one needs to rebase and their updated MR will go through CI again, now on top of the first author's commit that has already been merged.


Ah I see what you mean. Yeah that is roughly equivalent to a manual merge queue. It's going to fall apart on medium/large teams because you'll constantly be rebasing and rerunning CI.


I think that is unique to Gitlab though. For example Azure DevOps will rebase the branch for you when completing a pull request.


Oops, that was a slip:

> Of course I did not look at merge conflicts

Should of course been "merge commits".

Sorry, too late to edit.


lots; the tongue-in-cheek name is "evil merge" but it's honestly pretty common precisely because, as you say, the mere textual merge could be a broken result.

"conflict" is a bad name because it sounds like some kind of error; it's just a place where a simple algorithm can't make a judgment call so a human has to. and it's not the only place.

the only reason an "evil merge" is "evil" is because the tooling breaks down a little around it (however! if you look at diffs for commits, nearly all tooling will show you the useful information here, e.g. in gitk it's color coded bold black)


Depending on the context, I definitely think of commits as all three of diff, snapshot, or history.

When I'm coding, I think of myself as "authoring a change" and the commit I post for code review, and later rebase, is the change I'm pushing. (but simultaneously the commit I'm developing against is a snapshot!).

Once code is pushed, I flip over to "history". At that point, I'm using the commit as an identifier for a release, which generally contains multiple changes. The primary questions at that point are "what's in this release?" and "How does it differ from the previous release?" and mentally I see that as a set of changes rather than a single diff, even though production doesn't care how many commits are in the difference.

The unscientific poll excluded "all three". I guess I have to go with "diff" as "most true" but it simply makes no sense when using a commit id to identify what to check out or release.


Well, git makes a distinction between a delta and a diff, and it uses both.

A delta is a lot like a diff, and loosely speaking, you could say it's one kind of diff. And deltas are used (in packfiles) for storing git objects, so in that sense, a commit is (sometimes) stored as a delta.

But each commit takes a snapshot of the entire tree, so in that sense it is a complete snapshot.

However, even without delta object storage, if a file does not change between one commit and another, then its blob object will be reused because its hash will match, so in that sense, a commit stores (loosely speaking, again) a kind of diff.

Also, a commit object refers to a tree object but also zero or more parent commit objects, and via these parent object(s), you can reach past commits, so a commit does also, in a sense, store (loosely speaking) a history of every past commit. (Well, not every, but every one reachable from this branch.)

So, none of the options is 100% wrong, depending on your interpretation and how you define terms.


I think you're getting a bit confused. The delta compression of commits in packfiles is purely a storage optimisation. It has zero effect on the semantics of git. It's completely invisible to users and has nothing to do with diffs between commits and their parents.

Git doesn't use diffs at all, except ephemerally for some operations (e.g. displaying diffs between commits, or rebases). Diffs are never stored.


I understand all that.

I'm saying, if you write a survey and one of the possible answers is "diff", but you don't clearly define what you mean by "diff", then don't be surprised if respondents use any reasonable definition that makes sense to them. Ask an ambiguous question, get a mishmash of answers.

The thing that Git uses for packfiles is called a "delta" by Git, but it's also reasonable to call it a "diff". After all, Git's delta algorithm is "greatly inspired by parts of LibXDiff from Davide Libenzi"[1]. Not LibXDelta but LibXDiff.

Yes, how Git stores blobs (using deltas) is orthogonal to how Git uses blobs. But while that orthogonality is useful for reasoning about Git, it's not wrong to think of a commit as the totality of what Git does, including that optimization. (Some people, when learning Git, stumble over the way it's described as storing full copies, thinking it's wasteful. For them to wrap their heads around Git, they have to understand that the optimization exists. Which makes sense because Git probably wouldn't be practical if it lacked that optimization.)

The reason I'm bringing all this up is, if you're trying to explain Git, which is what the original article is about, then it's very important to keep in mind that someone who is learning Git needs to know what you mean when you say "diff". Most people who already know Git would tend to gravitate toward the definition of "diff" that you're assuming (the thing that Git computes on the fly and never stores), but people who already know Git aren't the target audience when you're teaching Git.

---

[1] https://github.com/git/git/blob/master/diff-delta.c


It's pretty clear that "diff" means "what git diff shows", not some obscure internal compression system that most users have no idea exists.


Commits are also used as diffs by several git commands - rebase, cherry-pick, patch, probably others. These commands take a commit and a target branch, but they use the commit as a diff from its parent, not as a snapshot. So it's fair to say that git's own API treats commits as diffs in some situations.


Yes.

It depends what you're doing. Most of the basic commands, they're snapshots. When you get into things like rebase especially, you have to think in terms of diffs.

I'm not sure what "histories" are, as a distinct thing from snapshots, kind of seems like the same thing.


I agree with you, just want to explain what a history is.

A "history" is what other version control systems call a "branch." It's the commit and all of its ancestors.

Some people think of things this way because the commit ID does depend on its ancestors by depending on its parent (which depends on its parent, which depends on its parent...).

This is why, if you rebase a commit, and the content ends up exactly the same (maybe someone already made the equivalent change in both branches), the commit ID still changes: the parent has changed, so the hash changes, so the ID changes.

I hope that makes sense.


Regarding rebase, I disagree. Thinking about a branch as a series of diffs can lead to a lot of unexpected outcomes


Snapshots and diffs aren't enough. They don't describe the DAG structure.


There's not a lot more to Git's DAG in a commit besides "Snapshot", "Comment", and "Pointer to previous commits". Even signatures are just a bit of porcelain to the data structure.


One thing to explain git that I think it's highly useful for beginners (but I haven't been able to prove it) is to explain that commits are basically "project (copy)" folders that people without git create to "save".

Then you can explain that git saves them in a highly efficient format (even better than zipping them) and allows to fast operations like comparing.


"history vs snapshot" is a kind of weird distinction that could use some pulling apart.

A commit is a snapshot, but that snapshot includes a reference to prior history. It "isn't" the history per se, but includes a way to get it. When we `git log` it's clearly thinking in terms of a sequence of commits, not a single commit with multiple snapshots.

A commit is not a full history; a lot of stuff is only in the reflog, and the reflog isn't permanent. Losing the branch name is a big deal but there are no good solutions and mercurial's attempt is definitely worse (flashbacks of `svn rename`, which I'm not sure I ever used personally but heard enough horror stories about that I basically refused to actually use any VCS until git saved the world)

A commit (or more often, a series of commits) can be used as a diff by virtue of looking at the history pointer. But there are enough footguns that thinking in terms of diffs alone is clearly harmful.


If you don't like losing branch names then I'm genuinely curious what you think is wrong with Mercurial's solution?


Mercurial branches get in the way. Branches should be recorded in commits, not contain or constrain commits.

`git` records one side's branch name for merge commits, which is great ... except it doesn't record the other, or record anything for fast-forward or rebase. And what it does record isn't directly machine-readable.


Hmm. Are we both talking about Mercurial branches and not maybe bookmarks or something different? It's been a while since I've used mercurial but from what I remember the branch name is recorded as part of the commit?

I'm not sure I understand what you mean by containing or constraining or being in the way, either, sorry.

If you are rebasing you are modifying history generally to purposefully remove a branch in the DAG. I'm not sure you'd want to keep a branch name in that case?


I cannot find any reference for "`git` records one side's branch name for merge commits". From what I understand a merge commit just has two parent references (commit hashes) instead of one.


It sets the commit message to:

  Merge branch 'feature'
(several variations exist) and when people add a real commit message, they generally preserve that much at least.

With certain reasonable git workflows this technically does preserve sufficient metadata, but there are far too many edge cases for completeness.


If we don't care about implementation details then isn't it mathematically equivalent to say that a commit is a diff or a snapshot?

Each commit knows its parent. So from the diff you can calculate the snapshot, and from the snapshot you can calculate the diff.


Yeah I think you're mostly right though the snapshot mental model is definitely simpler IMO. Also there are some behaviours that would make no sense if it was really stored as diffs, e.g. shallow clones.

Also the ability to generate diffs between arbitrary commits means the mental model for that involves reconstructing a snapshot for each commit from diffs, and then calculating a diff between those snapshots, which I think is a much more complex thing to understand than the real design, even if it would be technically possible to implement it that way.

Good point though!


As it turns out, we care about details like “does diffing two branches show output inside of a week”. Diff-based systems were very slow at constructing a diff between divergent branches.


Sure, but in terms of understanding what to expect git to output, the two points of view should be equally valid.


In a merge commit with 4 parents, what diff does the commit represent?


The diff from the first parent. This isn't breaking any symmetry between the parents, since git already treats the first parent as special.


I've slowly come to realise that many (perhaps even a majority as the blog suggests) think of commits as diffs. This is strange to me.

You know that "report-v2-021119-final.doc" thing you see from people who don't use git? That's version control. That's all git is doing, except it gives you commit hashes, messages and tags instead of ad hoc filenames.

Diffs only come into play during rebases and cherry picking. Otherwise you can and should think of them as snapshots. But not because that's how git implements it. That is indeed irrelevant. You should think of them as snapshots because that's what they are. Because git is version control.


Interesting but for me reading the article left more questions than it answered.

If you are curious about the internals, the referenced blogpost[1] has a detailed explanation of object file reperesentation(s) and I cannot recommend it enough.

[1] https://codewords.recurse.com/issues/three/unpacking-git-pac...


or the git book


That’s interesting! I think of them as diffs, though I know they’re snapshots. More familiar with Mercurial though, which may influence my thinking.


I've used git nearly since it was first released, and I also have (more or less) always known how git implements commits. But to me, they are still diffs.


> It gets a little weird with merge commits, but maybe you just say it’s stored as a diff from the first parent of the merge.

Depending on the specific operation it can act like a diff from any of the parents (though the first is special) or all of them; it's (typically, and the way operations treat it) the result of a three-way merge using the parents and the most recent common ancestor found via graph walking, eliding some complexity.

To the author, refer back to your earlier post about three-way merging in this section, perhaps?


I think it's an error to conflate the history of commits to being the "history" of the repository, simply because you're putting an assumed expectation that each commit represents a progression from the previous

to illustrate: I could create 10 different small commits from a list of unstaged changes that follow a pattern of progression, I could create 3 commits that group properties (e.g. "added interface and implementation -> lint -> add tests"), I could do all of this in a single commit, but none of those represent the reality that none of these commits werent done as the work was done and that the commits were arranged after the fact to conform to a personal standard that has nothing to do with timeline objectivity. it's not a history of anything other than of how the committer decided to layer their changes.


Wow, I just realized that I've been playing around with git and mercurial for nearly 17 years! And people are still confused about git! That's amazing. I still love that it's distributed, fast, and it handles branches and merges better than anything before it. I do wish the UI was better.


How often do you use Mercurial, compared to Git?

> I do wish the UI was better.

The times I use Mercurial, it feels like it manages to abstract away the gory details while still achieving almost everything git does. Mercurial even has queues built-in, whereas Git needs something like stgit or topgit to achieve the same. Would you say that Mercurial meets your expectations?


I wish mercurial had won and we were all using it instead of git. I used hg exclusively from 2009-ish to 2016-ish and I still believe it has the better UI (but not perfect). I will say that git made mercurial better over the years, and quite probably mercurial made got better. It's too bad mercurial has all but died, I think general VCS progress has probably slowed without it.


I think the next useful development in SCCS will be to not store/process text but instead some sort of AST type structure representing the code. This should allow more intelligent handling of history, merges, refactoring, code review than is currently achievable.


That's a very old idea that never really worked. There are two main objections against that concept:

* Any tree structure would be highly language-specific. Your VCS wouldn't support experimental or new or propietary languages. Even worse, for most real languages, operations like refactoring require not a concrete or simple abstract syntax tree but some transformed or enriched (e.g., with type information) variant that is highly specific to the tool that performs the operation (think Eclipse internal tree structures). Language servers might help here, but they're far from the only implementation.

* In practice, if I have a textual diff and a tool that can generate the required tree, I can always apply that tool to the textual representation, just paying a little bit of compute. Compute is cheap.


I think this idea is a dead end. The Git model is already good for decently large and long-running projects like Linux (although not giant corporation monorepos).

So the version control problem is solved both for all past and future languages which are textual. But with an AST system you make the storage itself much more opinionated. Java and C ASTs are different, surely? So then the same storage can’t be used directly.

I have also thought that these ideas that you bring up could be built on top of Git. So Git solves the primitive task of storing code in a completely unopiniated way—doesn’t matter if “the tests pass” or whatever. Then what you propose lies on top of that primitiveness.


This feels like a false dichotomy. Is a shadow the absence of light or is it a black blob? Well: both! In a purely physical sense it's the absence of light but in your vision it's a black blob, and these are both true and useful.

Given that she wrote a whole post about it the author obviously knows that both are useful, but I'd expect most people to use both frames of reference and switch as necessary rather than picking just one.


> I think wrong mental models are often extremely useful, and this one doesn’t seem very problematic to me for every day Git usage.

In my experience of helping people with Git, this mental model is usually the main source of their every day confusion. It used to confuse me as well in the past before I realized that mental model was wrong. So yeah, I find it quite problematic.


It feels like OP is a bit stuck in their own preconceptions? I almost feel offended when I read:

> I feel like there’s a whole very coherent “wrong” set of ideas you can have about git

I feel like there's much opportunity missed where they could dive into git and figure out the concepts themself. Too much talk about what should and could left me an unsatisfied reader.


Reading through this and then thinking about rebases in particular.... I think whoever comes up with a good mechanical explanation of rebasing deserves a prize. I just imagine it as "it figures out diffs and then tries to apply them over at another place" but this runs up against reality way too often.


Does this explanation help?

http://bryan-murdock.blogspot.com/2022/12/git-rebase-explain...

Rebase is really a merge (actually one merge per commit you are rebasing). At least it should be. With mercurial it really seems to be. With git I'm not so sure because it feels like it's always worse at resolving conflicts than mercurial. I haven't tested it thoroughly or looked at the implementations


I don't think that's a very good explanation. It only considers one possible use of rebase, to resync a feature/topic branch with a main branch that has diverged since the branch was created. Another common use case for rebase is to clean up an already-linear history.

And even then, when using rebase to "resync" different branches, I don't think it's accurate to characterize it as a merge at all. It's a hard reset plus a series of cherry-picks.

(As an aside, this is why rebase can be much more annoying than a merge! Merging a branch onto another is usually quick and easy. Git's merge algorithms usually do a decent job of figuring out how to do a merge without conflicts. And if there are conflicts, you only have to resolve them once. If instead you rebase, you may have to resolve the same (or similar) conflicts over and over and over, once for each cherry-pick. And there might be conflicts that wouldn't have even occurred with a simple merge. Git's rerere mechanism does automate some of it, sometimes, though.)


A couple things. That blog post shows the case of rebasing a single commit and so yes, it doesn't talk about interactive rebase that lets you reorder and edit commits, and it doesn't talk about rebasing multiple commits.

In the case of multiple commits, it has to perform a merge operation for each of those commits, that's why you end up seeing the same merge conflicts over and over.

For interactive rebase, there are so many scenarios there that it would probably take a few posts to explore them all.


Interesting that this article omits any mention of "cherry-pick".


I recommend striving to understand what `git cherry-pick` does, since that's equivalent to rebasing a single commit without doing anything else. That parts is where you get rebase conflicts from. Everything else `git rebase` does is a combination of `git amend` and `git squash`.


That's a very good point. Thanks for that!


How does this "run up against reality", it is the reality, full stop, not even eliding tricky details. Whatever made you say that is the part of you that's confused here.


It's not the reality. It is a 3-way merge of each commit you are rebasing, not just applying diffs. Except sometimes when I'm resolving conflicts git mergetool doesn't seem to supply 3 files to kdiff3 (just two) which really really bothers me. Never saw that with Mercurial


It's the reality, it's just applying diffs. The OP's blog actually has an earlier post about how applying diffs uses a 3-way merge; before ort and its more versatile API which can handle just applying diffs, rebase literally used apply machinery.


Sorry, when I think of applying diffs I think of using the patch command that literally takes a set of files and a diff and applies the diff to the files. patch doesn't know any history and doesn't do 3-way merges. In my mind a 3-way merge is different than just "applying diffs"

I'll look for her post about rebase, it sounds interesting, thanks!


diff(old, new) is sort of a degenerate case of diff3(left, base, right) where left = base; if your patch came _from_ git in some capacity it can identify base and doesn't need to fall back


patch is a program that operates on diffs, which are also sometimes called patches, confusingly. Your reply seems to be conflating patch the program with diffs (AKA, patches)


Here's some Python pseudocode of the mental model that works well enough most of the time?

  def rebase(base, curr=HEAD):
      common_ancestor = git_merge_base(base, curr)
      git_checkout(common_ancestor)
      for commit in commit_range(common_ancestor, curr):
          git_cherry_pick(commit)
Does that help?


I imagine it as separating a branch from the rest of the tree and attaching it to a new base commit ("rebase"). Naturally you need to apply all diffs (interpreting commits as diffs here) from the original branch to the new base commit. So you re-based it.


This is like asking if you think of a DAG as nodes, edges, or paths. It’s all just different aspects of the same thing. This is also not specific to Git, all VCSs have that same duality (triplity?), and just different terminology for what they call a node (revision, changeset, whatever).


Is “both” an option?

The guy who invented git uses it to serialize code as diffs over email, so I think that speaks for itself.

At the same time, merge commits basically insist on being understood as snapshots.

My understanding is that for efficiency and performance, what’s under the hood is a little bit of both.


>The guy who invented git

What an inefficient way to spell Linus


The terminology here is a bit muddled. Git commits represent a merkle tree of commits where the each commit describes changes to a graph of objects (aka files and directories) represented by the ancestor commit (or commits in case of a merge). In the case of a merge, the commit includes conflict resolutions for any lines modified by multiple ancestors.

The way that is stored internally with binary deltas is clever and interesting but not all that relevant to end users. It's convenient that it's compact and fast. But otherwise it's just an implementation detail. From the outside, each commit basically represents a snapshot of the whole system. And that's exactly what you get when you check one out.

It's the fluff in the git porcelain and various Git UIs that muddle the waters by giving us different tools to produce new commits, stashes, staging, tools for flaging and resolve conflicts (two commits modifying the same thing), rebasing, squashing, etc.

That's where diffs, git patches, merge algorithms and other tools come in. Some of that is part of git. And some of it can be added via third party tools, uis, etc. Or even git alternatives that work in a compatible way. But in the end they just help you produce a new git commit.

Most of us just use the stuff that comes with Git and our IDEs. And of course it doesn't help that most git users have not read the manual section on the Git internals, have a poor understanding of how all this works, and generally only know about and use a small fraction of the git commands and options. It's a complicated tool. And it undeniably works very well for extremely large open source communities.


> where the each commit describes changes to a graph of objects

You're propagating misinformation, in this sentence. Each commit describes a complete snapshot and the previous snapshot. You do at least describe it correctly later.


Less commented ongoing discussion (currently 1 comment) https://news.ycombinator.com/item?id=38884498


I wonder why that didn't get deduped


diff model - considered harmful

For me, the incongruity between the way git was supposed to work, based on the idea that it stored deltas, and actual observed behavior was maddening.

More than once I completely ripped out a git repo just so I could get around the stupid merge conflicts I would end up with on a single user project.

We should always teach that git stores snapshots, and shows diffs as a convenient fiction for the user.


I suppose that depends on what matters to the user.

I'm not trying to deny you your experience, but I don't think (in 15+ years of use) that knowing that git stores snapshots (and not diffs) has ever really made much of a difference to me.

A commit to me is still a diff. When I commit something, conceptually -- to me, at least -- I'm storing a change, not a new snapshot. Because I didn't create a new empty directory on my hard drive and then retype everything, in the new state that I wanted. I took the existing code and modified it to the new state. How git stores that is not relevant to what I actually did.


I often had "merge conflicts" which I didn't know how to resolve. I thought I had to undo any changes before the delta could be applied. I've wasted more energy and effort as a result, than I like to think about.

Had I known I could just force a new snapshot, I wouldn't have had to nuke the .git folder and it's contents to start a new repo, trash the GitHub copy, and start it over, etc.


OK, but if you do some work, and make a backup with `tar -cf ../.tars/backup-001.tar .`, then do some more work, and make another backup with `tar -cf ../.tars/backup-002.tar .`, then you also didn't create a new empty directory and retype everything, but you'd still be storing snapshots, not changes.

(Which is also one way to think about how git conceptually handles it.)


commit == snapshot + reference to parent commit


How it's implemented in git, yes, but I don't think that's useful as a mental model.

A commit is a change -- diff -- from the previous version of the state of the repository. When I make a commit, I don't think of it as making a copy of my repository, adding my changes, and storing the entire result (though that's exactly what many people used to do before version control was common!). I think of it as storing the change that I made, and only the change.


How it's implemented in git, yes, but I don't think that's useful as a mental model.

pointers to content, right. "Remember, git does not track changes, it tracks content." I kept telling myself when I learned how the system works.

As conceptually great as that is, in the early days of learning git, I found the terminology in the command-line interface VERY off-putting, and it played havoc with my mental model. (I say this lest you think I'm unconditionally cheerleading git or the implementation.)

BTW, perhaps because effective use of a VCS is so vital to programming without stress, people tend to think they can learn it on the fly, which leads to all sorts of insecurities about not really knowing how a certain operation works. (I've worked in toxic places, so seen how fear has myriad weird manifestations.)

Anyway, not intending to disagree or correct you, just offering my experience while I suffer winter flu.


You may think of it that way, but it is actually storing a snapshot of the current state, with a pointer to the previous state. And that's what a commit is, it's a snapshot.


I'm well aware of how it actually works, but I don't think git's implementation details are a useful mental model of source control in general.


My answer is Yes.


snapshots in time where all the bits should work... many small tested changes with people coordinating to keep out of each-others way.

Or... a dumpster fire as that one guy we all know breaks the build with a backlog of commits on a Friday... 1 hour before the end of the week.

We all meet "that one guy" eventually. =)


A git commit is a change from one state of the repository to another. A change is a diff, because... that's what a diff -- difference -- is, by definition. So a commit is a diff. (However, the reverse is not true: a diff is not a commit! A diff can be a commit, but a diff can also be a representation of a series of commits, or something else entirely.)

A git tree is a snapshot. A snapshot is, by definition, a view of the world at a point in time. A commit doesn't show you the world: it only shows you what's changed between the previous world and the next world. The state of the tree -- a checkout that comprises all the files in a project, at a particular time, is a snapshot. I guess I can see how a commit might be confused for this: you can check out a tree at a particular commit, and that would be a snapshot. But the commit is not the tree! The commit is not the snapshot! You could also look at it this way: you are not checking out a commit; you are checking out a tree that happens to be identified by the ref that points to a commit. (Technically that's not even the case, or rather it is, but only coincidentally: while we can check out trees by giving git a commit ref, in reality the tree itself has a different object name.)

To relate the two paragraphs above, a commit is the description of the change between two different trees (aka snapshots).

Note that I am deliberately ignoring how git actually implements these things, as I don't think git's implementation is remotely useful as a mental model. Consider: git implements commits as snapshots, but is your mental model of "making a commit" that you make a fresh, empty directory, and then retype all your code from scratch, but with the changes you want to make in the commit? No, of course not, that's nonsense. (Ok, maybe that's too extreme, but I think it's still nonsense to think about it as making a copy of your repository, making your changes to the copy, and then storing the entire new copy.)

A (single) commit is sort of a history, but not really in a useful sense. To me, history is a progression over time. Sure, a single commit gives you the "history" (meh) of a single change, over a short period of time. But when I think of my "git history" or my "commit history", I'm thinking about a series of changes (commits) over some longer period of time. Like "the list of commits between these two tags show the history of changes between these two releases", or, more generally, "all the commits on the main branch show the entire history of the project".

> is “how Git implements it” really the right way to explain it?

No, I don't think so. Implementation details (in the general sense, not specific to git) often don't represent the mental model that's most useful for understanding a system. Git is a little interesting in that it actually is a lot easier to understand how to use git and be productive using git if you understand a bit about its implementation. But that doesn't mean your mental model of a version control system should depend on implementation details.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: