git backend, hg cli

LWN has an article with a nice chunky comment thread talking about the history of DVCSes and how git has basically taken over the category. Mozilla, of course, still mostly uses Mercurial, but there’s a lot of people who prefer git now, and there are bridges and stuff.

I have a weird perspective on all of this. I hacked on Monotone back in the day, so I have the basic DVCS concept cold, and Mercurial is only a little different; it never surprises me. Git, however… I read the documentation, and I think I understand what’s going on, and then I do something that according to (my understanding of) the documentation should do what I want, and instead it mangles my local repo and I get to spend an hour or two repairing it. Or, in one memorable case, it mangled the remote, shared repo—thankfully that was easily fixed once I figured out what it had done, but I still don’t know why it did that instead of what I expected it to. (A matter of which branch’s HEAD pointer got updated with the result of a merge.) I’ve been actively hacking on projects whose primary VCS is Git for over a year now and this consistently happens to me about once every 20 to 40 hours of coding time.

So I don’t trust Git and I don’t like using it. I do, however, appreciate its speed, which as far as I can tell is down to back-end stuff—storage format, network protocol, and so on. So here’s what I want: I want someone to write an exact clone of the Mercurial CLI that uses git’s back end. I have no time, but I would totally contribute money to the development of this. It has to be an exact clone in terms of command line behavior, though. If that means throwing away front-end features of Git, I am 100% fine with that. I would happily lose the index/working copy distinction, for instance. I could also live with losing support for arbitrary Mercurial extensions; I would miss MQ in principle but I suspect there’s an alternate development model for Mozilla that doesn’t need it. Everyone else seems to manage.

Anyone else interested in something like that?

Responses to “git backend, hg cli”

  1. Ed

    Sounds great!. May be someone could do it for a Kickstater project?

  2. Stephan Sokolow

    Ironically, I feel almost an exact mirror of your pain.

    I saw a lot of cursing over mercurial mangling repositories during my time lurking in the Audacious Media Player development channel, so I’ll never trust it with my data. (They’ve since moved to Git)

    I can’t live without the index/working copy distinction.

    Every now and then, I go looking to see if anyone’s written a Mercurial-focused analogue to git-svn.

    1. Zack Weinberg

      Curious; short of actual bugs, I can only think of one hg command that has the potential to destroy data, and it’s quite clearly labeled as such (hg strip). In contrast, every git command that either brings new changesets into the repository, or changes the relationship of the working copy to history, seems to have the potential to destroy data. In my hands, anyway. Do you happen to remember what Audacious devs were doing that was causing so much trouble?

      I find index/working copy entirely unnecessary (and an irritating speed bump for people used to older VCSes) but I haven’t had it blow up in my face, so that’s a point in its favor. What do you need it for?

      1. Justin L.

        I totally agree that Git’s command-line interface is confusing, and its docs are only useful if you already understand what the command does.

        In contrast, every git command that either brings new changesets into the repository, or changes the relationship of the working copy to history, seems to have the potential to destroy data. In my hands, anyway.

        This is not true at all.

        The only command in Git that destroys commit data (as far as I’m aware, and I’m no expert, so this is probably wrong) is git gc. There are other commands which will wipe out uncommitted changes to your repository, of course, just like hg up -C.

        If you mess up a git rebase or a git reset --hard or anything else, you have not lost any data! All you’ve done is change your branch from pointing to commit X to pointing to commit Y. If you want to change it back to X, you can run git reflog, find the hash of X, and then git reset --hard X. Nothing was lost.

        It’s not intuitive, but it /is/ easy once you get the hang of it.

        1. Zack Weinberg

          It may be true that more skilled hands could have repaired all of my broken Git repositories without having to re-clone. I have occasionally been able to do it myself, even.

          But I don’t especially care, because, Mercurial never puts me in this situation in the first place, whereas Git does it to me at least once a week (of active development time). I’m really not kidding about either of those estimates.

      2. Stephan Sokolow

        I don’t remember exactly what they were doing and I don’t think I was logging channel content at the time, but I’ll ask them if they remember and get back to you.

        I know it was something similar to what you’ve experienced with git though. (Following their intuitions from another VCS and doing things in a way experienced Mercurial users don’t)

        As for index/working, I started out with an attitude like yours. Now, I use it the way it was intended:

        1. I accidentally make more than one change without committing
        2. I run git gui
        3. I use the right mouse button to select which lines and hunks belong to each change
        4. I make several separate, clean commits

        (I keep git gui open on my second monitor. You can also run it in one commit, then exit mode as git citool)

        If you do want to ignore the index/working distinction, just set up an alias like co = commit -a and run git co all the time. It’s mentioned in most tutorials.

        1. Stephan Sokolow

          I asked nenolod but he doesn’t remember what it was and, if Chainsaw still comes on, I haven’t run into him.

          Either way, it was a similar scenario to what Gregory Szorc described. Mercurial trashed the repo badly enough that they had to re-clone and manually copy over their changes.

          Git’s documentation could definitely use some work (and, ideally, a really polished novice-to-expert tutorial that isn’t still under construction), but it really is excellent when it comes to keeping lost data until you git gc in case you need to recover it.

          The only time I’ve ever seen it trash a repo is when I’ve given it a command that makes sense in some obscure use case but didn’t do what I thought it would… and then it did exactly what it was told and no bugs in the code were involved.

        2. Zack Weinberg

          Regarding the value of Git’s staging area, I think this is a matter of the kind of changes I tend to make and the kind of code I tend to be working on: When I’m in the situation where several changes have gotten tangled together, being able to select individual files, or individual diff hunks, or even individual lines and pull them out to their own patch wouldn’t actually help me any, because the interdependence is inevitably within code expressions. My process in that situation is to dump the entire mess into an MQ patch and then manually break it up with Emacs’ diff-mode, which allows me to make arbitrary edits to both sides of the split.

  3. Dirkjan Ochtman

    Yes. In fact, I’ve gotten started on it, although I haven’t gotten very far yet. I have some C code and I even tried some stuff in Rust.

    TBH, I’m not sure you can do an exact clone. Yes, I would get rid of the index and staging and whatever, but I think you have to more or less adopt the branching model (i.e. branches are just pointers to changesets, outside of history) – not that that’s a bad thing, it would just be different from Mercurial.

    Anyway, once I get some further traction on this, I’ll let you know.

    1. Zack Weinberg

      … you more or less have to adopt the branching model …

      This worries me. I don’t know why git keeps mangling my repositories, but I know it has something to do with branches, because it never happens as long as the history is perfectly linear and I don’t need to go back in time. I believe my problem is with the CLI rather than the data model, but I could be wrong!

      Anyhow, I shall be very interested to hear what comes of your project, and if there’s anything I can do to help (given my extremely limited time) please let me know.

      1. Justin L.

        I feel like what you need is a decent git-for-mozilla-hackers tutorial. You’re not so much of an old dog that the only hope is to make one tool look exactly like another.

        I’ve been meaning to write this for a while. Maybe I’ll get to it soon. :)

      2. Dirkjan Ochtman

        Yeah, I’m pretty sure it’s not the data model, which I think is solid. It has been explained to me (similar to what Justin mentions) that it’s really hard to lose data with git even when using things that would be equivalent to hg strip; they just disappear from the UI. There are tools to get them back (unless the gc happens, of course). In fact, the guy who explained this to me said that he had had more data loss with Mercurial, in particular by using MQ, which can indeed be a bit dangerous, because it’s far too easy to mess up a patch file on disk.

        Mercurial is actually moving in the direction of being more git-like in this regard with the recent support for changeset phases and the near-future support for changeset obsolescence, which in the long run should replace MQ for most users.

      1. Dirkjan Ochtman

        It’s basically an attempt to build a repository class for git storage that works with all of Mercurial’s other code (i.e. the UI).

        I think the problems with hg-git end up being that because of the way the hashes are built up, you basically need a whole lot of state to be able to talk to remote git repositories from your Mercurial repo, to the point where it’s just easier to keep a .git dir around, too. So hgit is an attempt to just have the .git dir and replace all of the code to implement Mercurial’s storage with git storage equivalents (based on dulwich, which AIUI is a pretty nice Python library for this).

  4. Axel Hecht

    I guess the difference is people working with the DAG in mind and those that don’t. I’m too very unsure about what git might be doing when I enter the next command.

    I’m not sure if it’s possible to port mercurial over to the git backend or vice versa, looking at some of the concepts behind git via the GitPython lib, that sounded rather alien to what I’m used to interfacing with the mercurial python code.

    1. Zack Weinberg

      The DAG is conceptually the same between mercurial and git, though. The only difference I know of is that mercurial revisions are indelibly tagged with their branch of origin while git revisions aren’t, and it seems like that shouldn’t be a big deal. A minor point in mercurial’s favor when you need to do archaeology, perhaps.

  5. Gregory Szorc

    I’ve been hacking on hg-git the last few weekends to make it faster. As part of my work, I’ve been trying to factor out things into a generic library (one that isn’t so tightly coupled with being a Mercurial extension). Hopefully the end result could be used to power a frontend like you desire.

    From a technical perspective, using Git as the sole backend for Mercurial should be doable. Git supports arbitrary metadata in commit objects. So, everything you store in Mercurial can be captured by Git. There may be some rewriting going on, but that would all be transparent to the driver. The only incompatible difference between their storage models is that Git supports N>2 parents on a commit (an octopus merge) whereas Mercurial only supports up to 2 parents. Also, Mercurial does store per-file history. So, some file-level query operations would not translate well to Git at the API level.

    While I’m here, I will weigh in that I’ve had Mercurial destroy repository data multiple times, forcing me to re-clone. I’ve never had a Git repo get so corrupted that I couldn’t dig myself out from looking at the reflogs combined with git reset --hard. Git is actually very good about not throwing away data. I will admit that the tooling for recovery when you shoot yourself in the foot could be better.

    1. Zack Weinberg

      As I say upthread, it’s quite possible that the Git repositories I gave up on could have been repaired without re-cloning, but I don’t consider that a point in Git’s favor, because Mercurial never puts me in that situation in the first place. Never is not an exaggeration. I’d really, seriously like to know what you did to blow up a Mercurial repo, because it just doesn’t seem possible to me.

      1. Robert O'Callahan

        Hitting ctrl-C during various operations has corrupted my repo in the past, forcing me to reclone.

        The git interface seems overly complex to me, but I don’t think one can avoid learning git these days, so it makes sense for everyone to standardize on git as the DVCS until/unless something fundamentally better comes along.

        1. Gregory Szorc

          Nearly all of my corruptions have occurred as a result of hitting ctrl+c as well. I now know to a) never ctrl+c hg b) to not take any chances if I do and just re-clone, just in case.