A Zany Scheme for Compact Secure Hashes

Lots of current and near-future tech relies heavily on secure hashes as identifiers; these are usually represented as hexadecimal strings. For instance, in a previous post I threw out the strawman h: URN scheme that looks like this:

<!-- jQuery 1.5.2 -->
<script src="h:sha1,b8dcaa1c866905c0bdb0b70c8e564ff1c3fe27ad"></script>

Now the problem with this is, these hexadecimal strings are inconveniently long and are only going to get longer. SHA-1 (as shown above) produces 160-bit hashes, which take 40 characters to represent in hex. That algorithm is looking kinda creaky these days; the most convenient replacement is SHA-256. As the name implies, it produces 256-bit hashes, which take 64 characters to write out in hex. The next generation of secure hash algorithms, currently under development at NIST, are also going to produce 256-bit (and up) hashes. The inconvenience of these lengthy hashes becomes even worse if we want to use them as components of a URI with structure to it (as opposed to being the entirety of a URN, as above). Clearly some encoding other than hex, with its 2x expansion, is desirable.

Hashes are incompressible, so we can’t hope to pack a 256-bit hash into fewer than 32 characters, or a 160-bit hash into fewer than 20 characters. And we can’t just dump the raw binary string into our HTML, because HTML is not designed for that—there is no way to tell the HTML parser the next 20 characters are a binary literal. However, what we can do is find 256 printable, letter-like characters within the first few hundred Unicode code points and use them as an encoding of the 256 possible bytes. Continuing with the jQuery example, that might look something like this:

<script src="h:sha1,пՎЦbηúFԱщблMπĒÇճԴցmЩ"></script> <!-- jQuery 1.5.2 -->

See how we can fit the annotation on the same line now? Even with sha256, it’s still a little shorter than the original in hex:

<!-- jQuery 1.5.2 -->
<script src="h:sha256,ρKZհνàêþГJEχdKmՌYψիցyԷթνлшъÁÐFДÂ"></script>

Here’s my proposed encoding table:

    0              0 1              1
    0123456789ABCDEF 0123456789ABCDEF
 20 fghijklmnopqrstu vwxyzþ0123456789
 60 àèìòùáéíóúâêîôûç äëïöüāēīōūăĕĭŏŭð
 80 αβγδεζηθικλμνξπρ ςστυφχψωϐϑϒϕϖϞϰϱ
 A0 БГДЖЗИЙЛПФЦЧШЩЪЬ бгджзийлпфцчшщъь
 E0 աբգդեզէըթժիխծկհձ ղճմյնշոչպջռսվրցւ

All of the characters in this table have one- or two-byte encodings in UTF-8. Every punctuation character below U+007F is given special meaning in some context or other, so I didn’t use any of them. This unfortunately does mean that only 62 of the 256 bytes get one-byte encodings, but storage compactness is not the point here, and it’s no worse than hex, anyway. What this gets us is display compactness: a 256-bit hash will occupy exactly 32 columns in your text editor, leaving room for at least a few other things on the same line.

Choosing the characters is a little tricky. A whole lot of the code space below U+07FF is taken up by characters we can’t use for this purpose—composing diacritics, control characters, punctuation, and right-to-left scripts. I didn’t want to use diacritics (even in precomposed form) or pairs of characters that might be visually identical to each other in some (combination of) fonts. Unfortunately, even with the rich well of Cyrillic and Armenian to work with, I wasn’t able to avoid using a bunch of Latin-alphabet diacritics. Someone a little more familiar with the repertoire might be able to do better.

Responses to “A Zany Scheme for Compact Secure Hashes”

  1. Justin Dolske

    I suppose, if you really want to shorten such things, that distributed lookup table would be even shorter. Perhaps piggyback something on (secure) DNS or some custom lookup service.

    1. Zack Weinberg

      Well, there’s probably going to be a distributed lookup table keyed by these hashes, but taking the full hash out of the document would defeat the purpose, which is to have enough information in the document that you can validate subsidiary resources without having to trust anything but the document itself.

  2. Mysterious Andy

    It’s an interesting idea, but it falls flat if the file/stream uses an encoding, like Windows-1252 (sometimes called ANSI, often mistaken for ISO-8859-1), which lacks one or more of scripts you chose.

    There just isn’t space for 256 printable characters in any single-byte encoding, and I’m not even sure all the multi-bytes include them all.

    Given the history of character encoding, you’re probably really only safe selecting characters from the mother-of-all-fallbacks, US-ASCII. I can’t recall encountering a single encoding in the real world that doesn’t include ASCII while I can think of several off the top of my head that lack Cyrillic.

    It only has 90-something printable characters (and you won’t want to use a bunch of those), but you can still beat hex for density.

    In fact, unless there’s a reason not to use + or / how about Base64? It’s well-known, simple, and gets you 6 bits per character, so the hashes would be a hair over 2/3 as long as hex representations.

    1. Zack Weinberg
      It’s an interesting idea, but it falls flat if the file/stream uses an encoding, like Windows-1252 (sometimes called ANSI, often mistaken for ISO-8859-1), which lacks one or more of scripts you chose.

      Indeed, there is no legacy encoding that contains all of the characters I listed. You have to encode documents that use this scheme in UTF-8. That is 100% fine by me. I’m much more concerned about people not having convenient access to fonts that include all these characters; reading this post on my smartphone, all the Armenian letters were replaced with little boxes.

      I haven’t decided how serious I am about this scheme — that’s why I called it a zany scheme — but the goal is to be more compact than base64 in terms of space taken up on screen. Also, yes, +, /, and = are troublesome in URIs (especially /). Every ASCII punctuation character is assigned meaning by some spec relevant to the web. (The least troublesome characters are -, “_, and.“. I wish base64 had used those instead.)

      1. Jack Lloyd

        There is no reason you can’t use base64 with different punctuation, though. For instance there are multiple filesystem safe base64 encodings that also avoid using /, and some libs implement variants directly (for instance Python’s base64.urlsafe_b64encode).

  3. David Baron

    I think there are plenty of usable characters in the range you want such that you still avoid accents and punctuation and right-to-left and a bunch of other wierd things. In particular, using the (intentionally old) Unicode 4.0 database, if I run:

    head -2040 unicode/4.0/UNIDATA/UnicodeData.txt |
        grep "^0[0-7]" | grep -v '' | grep -v " WITH " |
        grep -v "MODIFIER LETTER " | grep -v "COMBINING " |
        grep -v "SPACING " | grep -v "LIGATURE" |
        grep -v "\(ARABIC\|HEBREW\|SYRIAC\|THAANA\)" |
        cut -d\; -f1-3 | grep "\(N\|L\).$"

    (the very last bit limits it to numeric and letter character classes), I get 506 available characters.

    Now there’s a bunch in there I’d avoid, but I think there’s still plenty. In particular, if you need 256, I’d suggest maybe picking from the following list of 300 (as unicode ranges, in hex): 30-39, 41-5a, 61-7a,c6,d0,de,de,df,e6,f0,fe,391-3a9,3b1-3c9,3e2-3ef,401-45f (maybe exclude some of those), 531-556,561-586.

    1. Zack Weinberg

      Okay, so that’s not that different from the set I used. Your suggested 300 is

      0030-0039 0123456789
      0061-007a abcdefghijklmnopqrstuvwxyz
      00c6      Æ
      00d0      Ð
      00de      Þ
      00df      ß
      00e6      æ
      00f0      ð
      00fe      þ
      03b1-03c9 αβγδεζηθικλμνξοπρςστυφχψω
      03e2-03ef ϢϣϤϥϦϧϨϩϪϫϬϭϮϯ
      0401-040f ЁЂЃЄЅІЇЈЉЊЋЌЍЎЏ
      0430-044f абвгдежзийклмнопрстуфхцчшщъыьэюя
      0450-045f ѐёђѓєѕіїјљњћќѝўџ
      0561-0586 աբգդեզէըթժիլխծկհձղճմյնշոչպջռսվտրցւփքօֆ

      from which we would definitely need to pull out the Greek and Cyrillic characters that are visually identical (modulo font) to Latin characters. I’d probably also throw out some of those accented Cyrillics, and I don’t want æ or ß because they’re sometimes considered ligatures and might get mangled in transit. But the 3e2-3ef range looks like it should be useful.

      1. Zack Weinberg

        If I’m as picky as I want to be, I wind up 48 characters short:

           0                1
           0123456789ABCDEF 0123456789ABCDEF
        00 0123456789ABCDEF GHIJKLMNOPQRSTUV
        20 WXYZÞabcdefghijk lmnopqrstuvwxyzþ
        60 աբգդեզէըթժիխծկհձ ղճմյնշոչպջռվտրփք
        80 БГДЖИЛПФЦЧШЩЪЭЮЯ бгджилпфцчшщъэюя
        A0 ΔΘΛΞΣΦΨΩϢϤϦϨϪϬϮα βγδζθλμξπσφχψϣϥϧ
        C0 ՖֆЂЉЊЋЏђљњћџϩϫϭϯ
  4. tom jones

    why stop there? if your goal is display compactness, why not encode 12 (or even more) bits per character? i am sure you can easily find 4096 letter-like symbols in unicode.

    and i don’t think you need to be careful when choosing characters, for example to ensure that they don’t look alike. hashes don’t ever differ in a single character or two, they always differ in all characters, so you can easily visually compare two hashes, even when some of the characters might be confused.

    1. Zack Weinberg

      It’s tempting, but I’m having enough trouble finding 256 characters that are sufficiently visually distinct and are all included in one monospace font that everyone has. Also, right-to-left alphabets cannot be used for this, so I very much doubt I could get 4096 letter-like symbols without resorting to CJK ideographs (which take up 2 columns per glyph, defeating the purpose).

      There is no mathematical guarantee that two different documents won’t hash to nearly the same fingerprint. (You may be thinking of the avalanche effect which is about small changes to the input.)

      1. tom jones

        well, there is no mathematical guarantee that two different documents won’t hash to exactly the same fingerprint – there is only very high probability that it wont happen.

        similarly, there is very high (slightly lower, but still very high) probability that any documents you work with (in your lifetime) won’t hash to similar fingerprints..

        (similar principle explains why Git can get away with only showing 7 characters of the hash by default in many places – not that there are no objects with equal first 7 chars, only the probability of two being in the same project/branch/revision is miniscule)

        1. Zack Weinberg

          The people designing cryptographic hashes have gone to considerable trouble to ensure that two different documents are overwhelmingly likely to have different fingerprints.

          They have not gone to any particular trouble to ensure that two different documents are overwhelmingly likely to have fingerprints that differ by more than one bit (except as an accidental byproduct of other desirable properties). In other words, your claim that any documents you work with (in your lifetime) won’t hash to similar fingerprints is only true by accident, and is not something to be relied on.

          I have personally seen Git produce collisions on the first 7 characters of a hash, and I’m sure there are dozens within any repository of significant size.

          1. tom jones

            The people designing cryptographic hashes have gone to considerable trouble to ensure that two different documents are overwhelmingly likely to have different fingerprints.

            IMHO, this statement is factually incorrect. i have researched this, both previously, and now for this conversation, and i am yet to find any sentence that supports this claim. i know that not finding something is not equivalent as proof of non-existence, and i know i am about to quote wikipedia, but still.. (OTOH, it would take for you to cite just one paper to disprove my reasoning, so i am ok with shifting the burden to you – a security researcher ;)

            except for the avalanche effect (which only applies to similar documents), the only thing that ensures two different (unrelated) documents are likely to have different hashes is the size of the hash key, and nothing else. all the requirements for cryptographic hash functions ensure that it is difficult to find two documents that hash to the same fingerprint, not that there are not any (or that they are unlikely): WP:Cryptographic_hash_function

            • it is easy (but not necessarily quick) to compute the hash value for any given message,
            • it is infeasible to generate a message that has a given hash,
            • it is infeasible to modify a message without hash being changed,
            • it is infeasible to find two different messages with the same hash.
          2. to contrast, the best that any hash function can hope for (except for perfect hash functions which is a different thing) is to be as good (unique) as a random sequence of bits: WP:Hash_function#Universal_hashing (emphasis mine)

            A universal hashing scheme is a randomized algorithm that selects a hashing function h among a family of such functions, in such a way that the probability of a collision of any two distinct keys is 1/n, where n is the number of distinct hash values desired—independently of the two keys. Universal hashing ensures (in a probabilistic sense) that the hash function application will behave as well as if it were using a random function, for any distribution of the input data.

            to get back to my original point, IMHO, using an alphabet with 20-30% characters that can be confused with another is equivalent to using 256bit vs 512bit SHA2, meaning that you don’t need longer 512bit hash to ensure uniqueness (or to visually distinguish between keys/documents), you need more bits to prevent brute force attacks (or even attacks that exploit a weakness in the algorithm).