HTML Fragment Parser with Substitution and Syntactic Sugar

This is a little off my usual beaten path, but what the heck.

This is two related proposals: one for a new DOM feature, document.parseDocumentFragment, and one for JS syntactic sugar for that feature. It is a response to Ian Hickson’s E4H Strawman, and is partially inspired by the general quasi-literal proposal for ES-Harmony.

Compared to Hixie’s proposal, this avoids embedding a subset of the HTML grammar in the JS grammar, while at the same time being more likely to conform with author expectations, since the HTML actually gets parsed by the HTML parser. It should have at least equivalent expressivity and power.

Motivating Example

function addUserBox(userlist, username, icon, attrs) {
  var section = h`<section class="user" {attrs}>
                    <h1>{username}</h1>
                  </section>`;
  if (icon)
    section.append(h`<img src="{icon}" alt=""/>`);
  userlist.append(section);
}

(more…)

Breaking things every six weeks

Attention conservation notice: 900 words of inside baseball about Mozilla. No security content whatsoever.

The Mozilla Project has been taking a whole lot of flak recently over its new “rapid release cycle”, in which there is a new major version of Firefox (and Thunderbird) every six weeks, and it potentially breaks all your extensions. Especially the big complicated extensions like Firebug that people cannot live without. One might reasonably ask, what the hell? Why would any software development team in their right mind—especially a team developing a critical piece of system infrastructure, which is what Web browsers are these days, like it or not—inflict unpredictable breakage on all their users at six-week intervals?

(more…)

A Zany Scheme for Compact Secure Hashes

Lots of current and near-future tech relies heavily on secure hashes as identifiers; these are usually represented as hexadecimal strings. For instance, in a previous post I threw out the strawman h: URN scheme that looks like this:

 <!-- jQuery 1.5.2 -->
 <script src="h:sha1,b8dcaa1c866905c0bdb0b70c8e564ff1c3fe27ad"></script>

Now the problem with this is, these hexadecimal strings are inconveniently long and are only going to get longer. SHA-1 (as shown above) produces 160-bit hashes, which take 40 characters to represent in hex. That algorithm is looking kinda creaky these days; the most convenient replacement is SHA-256. As the name implies, it produces 256-bit hashes, which take 64 characters to write out in hex. The next generation of secure hash algorithms, currently under development at NIST, are also going to produce 256-bit (and up) hashes. The inconvenience of these lengthy hashes becomes even worse if we want to use them as components of a URI with structure to it (as opposed to being the entirety of a URN, as above). Clearly some encoding other than hex, with its 2x expansion, is desirable.

Hashes are incompressible, so we can’t hope to pack a 256-bit hash into fewer than 32 characters, or a 160-bit hash into fewer than 20 characters. And we can’t just dump the raw binary string into our HTML, because HTML is not designed for that—there is no way to tell the HTML parser “the next 20 characters are a binary literal”. However, what we can do is find 256 printable, letter-like characters within the first few hundred Unicode code points and use them as an encoding of the 256 possible bytes. Continuing with the jQuery example, that might look something like this:

<script src="h:sha1,пՎЦbηúFԱщблMπĒÇճԴցmЩ"></script><!-- jQuery 1.5.2 -->

See how we can fit the annotation on the same line now? Even with sha256, it’s still a little shorter than the original in hex:

<!-- jQuery 1.5.2 -->
<script src="h:sha256,ρKZհνàêþГJEχdKmՌYψիցyԷթνлшъÁÐFДÂ"></script>

Here’s my proposed encoding table:

    0              0 1              1
    0123456789ABCDEF 0123456789ABCDEF
 00 ABCDEFGHIJKLMNOP QRSTUVWXYZÞabcde
 20 fghijklmnopqrstu vwxyzþ0123456789
 40 ÀÈÌÒÙÁÉÍÓÚÂÊÎÔÛÇ ÄËÏÖÜĀĒĪŌŪĂĔĬŎŬÐ
 60 àèìòùáéíóúâêîôûç äëïöüāēīōūăĕĭŏŭð
 80 αβγδεζηθικλμνξπρ ςστυφχψωϐϑϒϕϖϞϰϱ
 A0 БГДЖЗИЙЛПФЦЧШЩЪЬ бгджзийлпфцчшщъь
 C0 ԱԲԳԴԵԶԷԸԹԺԻԽԾԿՀՁ ՂՃՄՅՆՇՈՉՊՋՌՍՎՐՑՒ
 E0 աբգդեզէըթժիխծկհձ ղճմյնշոչպջռսվրցւ

All of the characters in this table have one- or two-byte encodings in UTF-8. Every punctuation character below U+007F is given special meaning in some context or other, so I didn’t use any of them. This unfortunately does mean that only 62 of the 256 bytes get one-byte encodings, but storage compactness is not the point here, and it’s no worse than hex, anyway. What this gets us is display compactness: a 256-bit hash will occupy exactly 32 columns in your text editor, leaving room for at least a few other things on the same line.

Choosing the characters is a little tricky. A whole lot of the code space below U+07FF is taken up by characters we can’t use for this purpose—composing diacritics, control characters, punctuation, and right-to-left scripts. I didn’t want to use diacritics (even in precomposed form) or pairs of characters that might be visually identical to each other in some (combination of) fonts. Unfortunately, even with the rich well of Cyrillic and Armenian to work with, I wasn’t able to avoid using a bunch of Latin-alphabet diacritics. Someone a little more familiar with the repertoire might be able to do better.

Legibility of embedded Web fonts

It’s recently become possible to em‌bed fonts in your website, so that you aren’t limited to using the same old fonts that everyone already has on their computer. Yay! Unfortunately, there are a lot of gotchas. Lots of people discuss the technical gotchas, but when you get past that, you’ve still got to worry about legibility.

Consider the recently redesigned online fiction zine, Chiaroscuro. As of this writing, they’re using an embedded font called Merriweather. [EDIT 8 April: Chiaroscuro has removed the problematic font from its site.]

Here’s what the first paragraph of body text for volume 47 looked like on my Mac, using Firefox 4:

Specimen of the “Merriweather” font as rendered by Firefox 4 on Mac OS X

Pretty slick, yeah? Unfortunately … here’s what that same para looked like on Windows, with the same browser:

Specimen of the “Merriweather” font as rendered by Firefox 4 on Windows

The letters are squished together in places, and the lowercase Ns are too tall. It’s even worse on Linux: not all the strokes are the same thickness, some of the letters are still too tall (look carefully at the lowercase D, for instance) and others extend below the baseline when they shouldn’t (such as the lowercase R).

Specimen of the “Merriweather” font as rendered by Firefox 4 on Linux

What causes this radically different appearance of the same font in the same browser? At typical body-text sizes, the computer has to draw each letter using only 15 or so pixels in each direction. It’s not possible to draw each letter exactly as the typographer intended, and keep all the lines crisp and smooth, with that few pixels. Windows, OSX, and Linux all resolve this dilemma differently: to oversimplify a bit, OSX tries harder to preserve the font shapes, Windows tries harder to make the lines sharp, and Linux tries to do both at once and winds up achieving neither. (For lots of technical discussion of exactly what the difference is, see these blog posts from 2007: Respecting The Pixel Grid, Font rendering philosophies of Windows & Mac OS X, Texts Rasterization Exposures). People argue, loudly, about which choice is better (as the above blog posts and their comment threads demonstrate) but I think it would be relatively uncontroversial to say that the Windows font-drawing algorithm only works well with help from the font itself. The Merriweather font on Chiaroscuro demonstrates this: it doesn’t provide this help (it doesn’t have enough “hinting” information) so it looks fine on OSX, but horrible on Windows (and Linux – although there it’s not quite so much the font’s fault).

This isn’t “just” a matter of aesthetics (scare quotes because nobody wants visitors to think their website is ugly); it can mean that people can’t read your text. I myself find Chiaroscuro unpleasant to read on Windows or Linux, but my acquaintance Rose Lemberg, who has weaker eyesight, says the site is illegible. I don’t think Chiaroscuro set out to be illegible, but I’ll bet cookies to donuts Chiaroscuro’s designer didn’t bother testing their new font on anything but a Mac.

I don’t want to deter people from using embeddable fonts altogether; however, this is another reason why you can’t just test your site on one operating system. At the very least you need to be testing on OSX and Windows (and I understand there are significant differences between XP and Vista/7 in this area, by the way); I would thank you for trying Linux as well (maybe you don’t care about desktop Linux, but Android uses the same font-drawing code). You might think that the font libraries at fontsquirrel.com or Google Web Fonts would have been checked for good rendering on all OSes, but it turns out Merriweather is available from both sites! So, while I’d still recommend starting with one of those libraries’ body-text fonts, it doesn’t get you out of testing.

(Note: Merriweather’s designer is aware that it looks terrible on Windows, and is working on it. Still, it seems to me that inclusion in public catalogs of fonts “designed for the web” was premature.)

Strawman: MIME type for fonts

For a little while now, it’s been possible for websites to embed fonts that all major browsers will pick up on. This of course implies fonts being served as HTTP resources. But it turns out that nobody has bothered to assign any of the common font formats a MIME type.1 Fonts being embedded on the web nowadays come in two flavors and three kinds of container: you either have TrueType or PostScript CFF-style outline glyphs, and they are in a bare “OpenType” (really sfnt) container, or else compressed with either WOFF or EOT. (I am ignoring SVG fonts, which are spottily supported and open several cans of worms that I don’t want to get into right now.) In the future, people might also want to embed TTC font collections, which are also in a sfnt container and could thus also be compressed with WOFF—not sure about EOT there—and bare PostScript Type 1 fonts, but neither of these is supported in any browser at present, as far as I know. There is no official MIME type for any of these combinations; therefore, people deploying fonts over HTTP have been making them up. Without trying very hard, I found real sites using all of: application/ttf, application/otf, application/truetype, application/opentype, application/woff, application/eot, any of the above with an x-prefix, or any of the above in font/ instead of application/ (with or without the x-). There is no top-level font MIME category, making this last particularly egregious.

All of these made-up types work because browsers don’t pay any attention to the content type of a web-embedded font; they look at the data stream, and if it’s recognizably a font, they use it. Such “sniffing” has historically caused serious problems—recall my old post regarding CSS data theft—so you might expect me to be waving red flags and arguing for the entire feature to be pulled until we can get a standard MIME category for fonts, standard subtypes for the common ones, and browsers to start ignoring fonts served with the wrong type. But I’m not. I have serious misgivings about the whole “the server-supplied Content-Type header is gospel truth, content sniffing is evil” thing, and I think the font situation makes a nice test case for moving away from that model a bit.

Content types are a security issue because many of the file formats used on the web are ambiguous. You can make a well-formed HTML document that is simultaneously a well-formed CSS style sheet or JavaScript program, and attackers can and have taken advantage of this. But this isn’t necessarily the case for fonts. The sfnt container and its compressed variants are self-describing, unambiguously identifiable binary formats. Browsers thoroughly validate fonts before using them (because an accidentally malformed font can break the OS’s text drawing code), and don’t allow them to do anything but provide glyphs for text. A good analogy is to images: browsers also completely ignore the server’s content-type header for anything sent down for an <img>, and that doesn’t cause security holes—because images are also in self-describing binary formats, are thoroughly validated before use, and can’t do anything but define the appearance of a rectangle on the screen. We do not need filtering on the metadata, because we have filtering on the data itself.

Nonetheless, there may be value in having a MIME label for fonts as opposed to other kinds of binary blobs. For instance, if the server doesn’t think the file it has is a font, shouldn’t it be able to convince the browser of that, regardless of whether the contents of the file are indistinguishable from a font? (Old hands may recognize this as one of the usual rationales for not promoting text/plain to text/html just because the HTTP response body happens to begin with <!DOCTYPE.) The current draft standard algorithm for content sniffing takes this attitude with images, recommending that browsers only treat HTTP responses as images if their declared content type is in the image/ category, but ignore the subtype and sniff for the actual image format. With that in mind, here’s my proposal: let’s standardize application/font as the MIME type for all fonts delivered over the Internet, regardless of their format. Browsers should use only fonts delivered with that MIME type, but should detect the actual format based on the contents of the response body.

I can think of two potential problems with this scheme. First, it would be good if browsers could tell servers (using the normal Accept: mechanism) which specific font formats they understand. Right now, it’s reasonable to insist that browsers should be able to handle either TrueType or PostScript glyph definitions, in either bare sfnt or compressed WOFF containers, and ignore the other possibilities, but that state won’t endure forever. SVG fonts might become useful someday (if those cans of worms can be resolved to everyone’s satisfaction), or someone might come up with a new binary font format that has genuine advantages over OpenType. I think this should probably be handled with accept parameters, for instance Accept: application/font;container=sfnt could mean “I understand all OpenType fonts but no others”. The other problem is, what if someone comes up with a font format that can’t reliably be distinguished from an OpenType font based on the file contents? Well, this is pretty darn unlikely, and we can put it into the RFC defining application/font that future font formats need to be distinguishable or else get their own MIME type. The sfnt container keeps its magic number (and several other things that ought to be in the file header) in the wrong place, but as long as all the other font formats that we care about put their magic number at the beginning of the file where it belongs, that’s not a problem.


1 To be precise, there is a standard MIME type for a font format: RFC 3073 defines application/font-tdpfr for the Bitstream PFR font format, which nobody uses anymore, except possibly some proprietary television-related products. Bitstream appear to have been trying to get it used for web fonts back in the days of Netscape 4, and then to have given up on it, probably because the font foundries’ attitude was NO YOU CAN’T HAS LICENSE FOR WEBS until just last year.

Data theft with CSS

Mozilla has released security updates to Firefox 3.5 and 3.6 that include defenses for an old, little-known, but serious security hole: cross-site data theft using CSS. These defenses have a small but significant chance of breaking websites that rely on “quirks mode” rendering and use a server in another DNS domain (e.g. a CDN) for their style sheets.

In this article I’ll describe the attack, what we’re doing about it, how you can ensure that your site will continue to work, and how you can protect your users who have not upgraded their browsers yet.

(more…)

More on SSL errors

I got some great responses to my ideas for SSL errors and I thought I’d make a new post to talk about them, since that post is old enough that you can’t comment on it anymore. I should probably emphasize up front that I’m not on Firefox’s UX team, I don’t know if they’re listening to my suggestions, and anyway they were meant as a starting point rather than completely finished designs.

David Bolton wanted to know why some of the error screens asked the user to visit other sites manually, rather than doing checks behind the scenes. The main reason, honestly, is that that made a good example thing the user could do next. In practice we probably would want to do at least some checks in the background. Right now, another reason would be that error pages do not have “chrome” privileges so they can’t do anything of the sort (this is part of why the certificate error screen pops up a separate dialog box if you say you want to add an exception) but we may be able to get around that in a real implementation.

John Barton, in email, points out that SSL errors often come up in practice because of server-side configuration changes that ought to have been transparent to users, but a sysadmin goofed. I’ve been using the Certificate Patrol extension, which brings up warnings when a site’s cert changes in any way; this reveals that cert handling mistakes happen even on very popular and well-staffed sites (recently, for instance, mail.google.com flipped back and forth between its own cert and the generic *.google.com cert several times in one day). Of course that would have been invisible to most people, but it’s not much harder to make mistakes that do trigger warnings in a stock browser.

My general feeling on that is, yes, it is way too hard to administer an SSL-encrypted web site, and I would wholeheartedly support an initiative to make it easier, especially for sites that carry information of only moderate sensitivity (e.g. the plethora of Bugzilla instances with self-signed certs out there in the wild). I don’t think that should stop us from raising the visibility of SSL administration mistakes, as long as we improve the presentation and advice on those mistakes so we are not just training people to click through the errors.

John also points out that most people won’t have any idea what “Herdict” is or why they are trustworthy. The explicit mention of Herdict was mainly because I was riffing off Boriss’ earlier proposal to use Herdict information to improve page not found errors. Indeed, we should probably put it more like “Other people who try to visit this website get (something) which (is/isn’t) what you got.” We should credit whatever service we use for that information, but it doesn’t have to be as prominent as I made it.

Someone else (whose name I have lost; sorry, whoever you were!) pointed me at the Perspectives extension, which is said to do more or less exactly what I proposed, as far as comparing certificates seen by the user with those seen by “notaries” at other network locations. I like the use of the term “notary” and the proof of concept; unfortunately, Perspectives seems not to be actively maintained at the moment, and doesn’t work with Firefox 3.6. Also, for privacy, we want to make the queries to the notaries as uninformative as possible to an adversary that can observe network traffic. Reusing the same system that is used for “is this site down?” requests would help there. (Ideally, the notaries would also be unable to tell which users are asking what about which sites, but that might not be tractable.)

Mozilla Co. conference rooms

The Mozilla Corporation’s new(ish) office in downtown Mountain View has all its third-floor conference rooms named after Internet memes, except those that are named after rooms aboard the starship Enterprise. I’d like to share them with you now.

Small conference rooms (memes)

Large conference rooms (Star Trek)

Better SSL error screens

Right now, when you visit a website that uses encryption in Firefox and there’s anything at all wrong with the encrypted connection, you get this screen:

The current SSL warning screen, which is generic and uninformative unless you know how to read SSL certificates already

This is a big block of jargon which doesn’t do anything to tell the user how big the risk actually is, or help them distinguish a minor problem from a major one. If you click on “technical details” you get a little bit more information about what went wrong, but it still doesn’t make any effort to give advice.

The Firefox UI team has been talking about using Herdict or a similar service to improve network error screens, especially the site not found screen. I think we could get a lot of mileage out of that for SSL errors as well. We should also make use of the user’s history with the site, and pay attention to what exactly is wrong with the credential. Here are some examples.

Proposed warning screen for a website with a self-signed certificate

The only problem with self-signed certificates is they haven’t been signed by a trusted third party. The connection is secure, but you might not be talking to who you think you are. In the first section, we emphasize that the concern here is with identity, and we use Herdict information to deduce that this is probably not a hijacked site, because lots of people get the same credential. (“The same credential” means exactly the same, not just some self-signed cert, but we needn’t bother people with that unless they want to see the details.)

In the “What should I do?” section, we give some examples of things that might be unsafe to trust this site with, but we go ahead and let them visit the site, automatically storing the self-signed cert and marking it valid for this site only. We implement bug 251407, so we can promise to notify the user if the site’s credentials change in the future.

I’ve front-loaded the information that used to be in the “technical details” section, so it has been replaced with “Inspect the Credentials”. If you open that area up, it shows the certificate, but in a more user-friendly way than the existing certificate dialog box does. Especially important here is to reveal the interesting parts immediately, highlight suspicious things, and deemphasize the jargon and the long hexadecimal numbers.

“I understand the risks” is still there, but in this case, it’s for people who didn’t read the rest of the page. It’s meant to make people stop, slow down, and reread. If you click on it you get another link to the page.

Proposed warning screen when connection tampering has been detected

There are exploits in the wild that take over your WiFi hub, or your cable modem. Once they’ve done that, they are in a position to tamper with all your Internet traffic. I ran into one of these for reals last week; I was in a café and getting certificate errors on every secure site I tried to visit, including Mozilla’s mail server. (The theory is that you’ll just click through the error messages because you want to get your email, or whatever; one of the staff at the café did just that when I complained.) Here’s where Herdict could really come in handy: if you are getting certificate errors but nobody else is, we can deduce a problem near your computer.

Again, the first section tries to be clear and specific about the problem: we suspect that someone is tampering with your Internet connection, and here is why. The second section underlines how big a deal this is: “Do not log into any site or buy anything online.” It then suggests a test: visit another secure website and see if the problem persists. This scenario should put the whole browser into a paranoid mode, where it will not load saved passwords and continues to try to work out whether there’s something wrong with the local router. Ultimately, we should advise people in this boat to factory-reset their WiFi hub and/or contact their ISP for help, but we should take care to be certain in our diagnosis first.

In this scenario, the “I understand the risks” section gives you access to the certificate-exception dialogs, as it does now.

Proposed warning screen for a website whose server may have been hijacked

Finally, here’s what it looks like in the comparatively rare scenario that SSL certificates were originally intended to defend against: the server has been hijacked (but the attackers do not have access to the cert). We can tell from browser history that the cert has changed, and we can tell from Herdict that it has changed for everyone. We tell the user not to visit this website, and again, suggest trying another secure site. (We need to take care to distinguish this case from an expired or legitimately changed cert.)

Boxes with Rounded Corners

A Russian translation of this article can be read at higher.com.ua.

The CSS 3 Backgrounds and Borders module introduces the border-radius property, which allows you to make the border of any CSS box be a rounded rectangle. Mozilla’s Gecko-based browsers (such as Firefox and SeaMonkey) have implemented parts of this feature for some time, as have Webkit-based browsers (such as Safari and Chrome). Firefox 3.5 adds support for elliptical corners, and brings the Gecko implementation into line with the standard on many details.

(more…)