Lots of current and near-future tech relies heavily on secure hashes as identifiers; these are usually represented as hexadecimal strings. For instance, in a previous post I threw out the strawman h: URN scheme that looks like this:
<!-- jQuery 1.5.2 -->
<script src="h:sha1,b8dcaa1c866905c0bdb0b70c8e564ff1c3fe27ad"></script>
Now the problem with this is, these hexadecimal strings are inconveniently long and are only going to get longer. SHA-1 (as shown above) produces 160-bit hashes, which take 40 characters to represent in hex. That algorithm is looking kinda creaky these days; the most convenient replacement is SHA-256. As the name implies, it produces 256-bit hashes, which take 64 characters to write out in hex. The next generation of secure hash algorithms, currently under development at NIST, are also going to produce 256-bit (and up) hashes. The inconvenience of these lengthy hashes becomes even worse if we want to use them as components of a URI with structure to it (as opposed to being the entirety of a URN, as above). Clearly some encoding other than hex, with its 2x expansion, is desirable.
Hashes are incompressible, so we can’t hope to pack a 256-bit hash into fewer than 32 characters, or a 160-bit hash into fewer than 20 characters. And we can’t just dump the raw binary string into our HTML, because HTML is not designed for that—there is no way to tell the HTML parser “the next 20 characters are a binary literal”. However, what we can do is find 256 printable, letter-like characters within the first few hundred Unicode code points and use them as an encoding of the 256 possible bytes. Continuing with the jQuery example, that might look something like this:
<script src="h:sha1,пՎЦbηúFԱщблMπĒÇճԴցmЩ"></script><!-- jQuery 1.5.2 -->
See how we can fit the annotation on the same line now? Even with sha256, it’s still a little shorter than the original in hex:
<!-- jQuery 1.5.2 -->
<script src="h:sha256,ρKZհνàêþГJEχdKmՌYψիցyԷթνлшъÁÐFДÂ"></script>
Here’s my proposed encoding table:
0 0 1 1
0123456789ABCDEF 0123456789ABCDEF
00 ABCDEFGHIJKLMNOP QRSTUVWXYZÞabcde
20 fghijklmnopqrstu vwxyzþ0123456789
40 ÀÈÌÒÙÁÉÍÓÚÂÊÎÔÛÇ ÄËÏÖÜĀĒĪŌŪĂĔĬŎŬÐ
60 àèìòùáéíóúâêîôûç äëïöüāēīōūăĕĭŏŭð
80 αβγδεζηθικλμνξπρ ςστυφχψωϐϑϒϕϖϞϰϱ
A0 БГДЖЗИЙЛПФЦЧШЩЪЬ бгджзийлпфцчшщъь
C0 ԱԲԳԴԵԶԷԸԹԺԻԽԾԿՀՁ ՂՃՄՅՆՇՈՉՊՋՌՍՎՐՑՒ
E0 աբգդեզէըթժիխծկհձ ղճմյնշոչպջռսվրցւ
All of the characters in this table have one- or two-byte encodings in UTF-8. Every punctuation character below U+007F is given special meaning in some context or other, so I didn’t use any of them. This unfortunately does mean that only 62 of the 256 bytes get one-byte encodings, but storage compactness is not the point here, and it’s no worse than hex, anyway. What this gets us is display compactness: a 256-bit hash will occupy exactly 32 columns in your text editor, leaving room for at least a few other things on the same line.
Choosing the characters is a little tricky. A whole lot of the code space below U+07FF is taken up by characters we can’t use for this purpose—composing diacritics, control characters, punctuation, and right-to-left scripts. I didn’t want to use diacritics (even in precomposed form) or pairs of characters that might be visually identical to each other in some (combination of) fonts. Unfortunately, even with the rich well of Cyrillic and Armenian to work with, I wasn’t able to avoid using a bunch of Latin-alphabet diacritics. Someone a little more familiar with the repertoire might be able to do better.
I suppose, if you really want to shorten such things, that distributed lookup table would be even shorter. Perhaps piggyback something on (secure) DNS or some custom lookup service.
Well, there’s probably going to be a distributed lookup table keyed by these hashes, but taking the full hash out of the document would defeat the purpose, which is to have enough information in the document that you can validate subsidiary resources without having to trust anything but the document itself.
It’s an interesting idea, but it falls flat if the file/stream uses an encoding, like Windows-1252 (sometimes called “ANSI”, often mistaken for ISO-8859-1), which lacks one or more of scripts you chose.
There just isn’t space for 256 printable characters in any single-byte encoding, and I’m not even sure all the multi-bytes include them all.
Given the history of character encoding, you’re probably really only safe selecting characters from the mother-of-all-fallbacks, US-ASCII. I can’t recall encountering a single encoding in the real world that doesn’t include ASCII while I can think of several off the top of my head that lack Cyrillic.
It only has 90-something printable characters (and you won’t want to use a bunch of those), but you can still beat hex for density.
In fact, unless there’s a reason not to use “+” or “/” how about Base64? It’s well-known, simple, and gets you 6 bits per character, so the hashes would be a hair over 2/3 as long as hex representations.
Indeed, there is no legacy encoding that contains all of the characters I listed. You have to encode documents that use this scheme in UTF-8. That is 100% fine by me. I’m much more concerned about people not having convenient access to fonts that include all these characters; reading this post on my smartphone, all the Armenian letters were replaced with little boxes.
I haven’t decided how serious I am about this scheme — that’s why I called it a zany scheme — but the goal is to be more compact than base64 in terms of space taken up on screen. Also, yes, “+”, “/”, and “=” are troublesome in URIs (especially “/”). Every ASCII punctuation character is assigned meaning by some spec relevant to the web. (The least troublesome characters are “-”, “_”, and “.”. I wish base64 had used those instead.)
There is no reason you can’t use base64 with different punctuation, though. For instance there are multiple ‘filesystem safe’ base64 encodings that also avoid using ‘/’, and some libs implement variants directly (for instance Python’s base64.urlsafe_b64encode).
I think there are plenty of usable characters in the range you want such that you still avoid accents and punctuation and right-to-left and a bunch of other wierd things. In particular, using the (intentionally old) Unicode 4.0 database, if I run:
head -2040 unicode/4.0/UNIDATA/UnicodeData.txt | grep “^0[0-7]” | grep -v ” | grep -v ” WITH ” | grep -v “MODIFIER LETTER ” | grep -v “COMBINING ” | grep -v “SPACING ” | grep -v “LIGATURE” | grep -v “\(ARABIC\|HEBREW\|SYRIAC\|THAANA\)” | cut -d\; -f1-3 | grep “\(N\|L\).$”
(the very last bit limits it to numeric and letter character classes), I get 506 available characters.
Now there’s a bunch in there I’d avoid, but I think there’s still plenty. In particular, if you need 256, I’d suggest maybe picking from the following list of 300 (as unicode ranges, in hex): 30-39, 41-5a, 61-7a,c6,d0,de,de,df,e6,f0,fe,391-3a9,3b1-3c9,3e2-3ef,401-45f (maybe exclude some of those), 531-556,561-586.
Okay, so that’s not that different from the set I used. Your suggested 300 is
from which we would definitely need to pull out the Greek and Cyrillic characters that are visually identical (modulo font) to Latin characters. I’d probably also throw out some of those accented Cyrillics, and I don’t want æ or ß because they’re sometimes considered ligatures and might get mangled in transit. But the 3e2-3ef range looks like it should be useful.
If I’m as picky as I want to be, I wind up 48 characters short:
why stop there? if your goal is “display compactness”, why not encode 12 (or even more) bits per character? i am sure you can easily find 4096 “letter-like” symbols in unicode.
and i don’t think you need to be careful when choosing characters, for example to ensure that they don’t look alike. hashes don’t ever differ in a single character or two, they always differ in all characters, so you can easily visually compare two hashes, even when some of the characters might be confused.
It’s tempting, but I’m having enough trouble finding 256 characters that are sufficiently visually distinct and are all included in one monospace font that everyone has. Also, right-to-left alphabets cannot be used for this, so I very much doubt I could get 4096 letter-like symbols without resorting to CJK ideographs (which take up 2 columns per glyph, defeating the purpose).
There is no mathematical guarantee that two different documents won’t hash to nearly the same fingerprint. (You may be thinking of the “avalanche effect” which is about small changes to the input.)
well, “there is no mathematical guarantee that two different documents won’t hash to exactly the same fingerprint” – there is only very high probability that it wont happen.
similarly, there is very high (slightly lower, but still very high) probability that any documents you work with (in your lifetime) won’t hash to similar fingerprints..
(similar principle explains why Git can get away with only showing 7 characters of the hash by default in many places – not that there are no objects with equal first 7 chars, only the probability of two being in the same project/branch/revision is miniscule)
The people designing cryptographic hashes have gone to considerable trouble to ensure that two different documents are overwhelmingly likely to have different fingerprints.
They have not gone to any particular trouble to ensure that two different documents are overwhelmingly likely to have fingerprints that differ by more than one bit (except as an accidental byproduct of other desirable properties). In other words, your claim that “any documents you work with (in your lifetime) won’t hash to similar fingerprints” is only true by accident, and is not something to be relied on.
I have personally seen Git produce collisions on the first 7 characters of a hash, and I’m sure there are dozens within any repository of significant size.
“The people designing cryptographic hashes have gone to considerable trouble to ensure that two different documents are overwhelmingly likely to have different fingerprints.”
IMHO, this statement is “factually incorrect”. i have researched this, both previously, and now for this conversation, and i am yet to find any sentence that supports this claim. i know that “not finding something” is not equivalent as “proof of non-existence”, and i know i am about to quote wikipedia, but still.. (OTOH, it would take for you to cite just one paper to disprove my reasoning, so i am ok with shifting the burden to you – a security researcher ;)
except for the avalanche effect (which only applies to similar documents), the only thing that ensures two different (unrelated) documents are likely to have different hashes is the size of the hash key, and nothing else. all the requirements for cryptographic hash functions ensure that it is difficult to find two documents that hash to the same fingerprint, not that there are not any (or that they are unlikely): WP:Cryptographic_hash_function
to contrast, the best that any hash function can hope for (except for perfect hash functions which is a different thing) is to be as good (unique) as a random sequence of bits: WP:Hash_function#Universal_hashing (emphasis mine)
to get back to my original point, IMHO, using an alphabet with 20-30% characters that can be confused with another is equivalent to using 256bit vs 512bit SHA2, meaning that you don’t need longer 512bit hash to ensure uniqueness (or to visually distinguish between keys/documents), you need more bits to prevent brute force attacks (or even attacks that exploit a weakness in the algorithm).