“Four Ideas for a Better Internet” 2011

On Tuesday night I attended a talk at Stanford entitled “Four Ideas for a Better Internet.” Four groups of Harvard and Stanford Law students, having just completed a special seminar entitled “Difficult Problems in Cyberspace,” each presented a proposed improvement to the internets; they were then grilled on said proposal by a panel of, hm, let’s call them practitioners (many but not all were from the industry). Jonathan Zittrain moderated. In general, I thought all of the proposals were interesting, but none of them was ready to be implemented; they probably weren’t intended to be, of course, but I—and the panelists—could poke pretty serious holes in them without trying very hard.

The first proposal was to improve social network security by allowing you to specify a group of extra-trusted “friends” who could intervene to protect your social-network presence if it appeared to have been hijacked, or who could vouch for a request you might make that requires extra verification (for instance, a request to change the email address associated with your account). This is quite intentionally modeled on similar practices found offline; they made an analogy to the (never-yet used) procedure in section 4 of the 25th amendment to the U.S. Constitution which allows the Vice President, together with a majority of the Cabinet, to declare the President temporarily unable to do his job. It’s not a bad idea in principle, but they should have looked harder at the failure modes of those offline practices—A25§4 itself goes on to discuss what happens if the President objects to having been relieved of duty (Congress has to decide who’s right). More down-to-earth, one might ask whether this is likely to make messy breakups worse, and why the “hey, moderators, this account looks like it’s been hijacked” button (not to be confused with the “hey, moderators, this account appears to belong to a spammer” button) couldn’t be available to everyone.

The third and fourth proposals were less technical, and quite closely related. The third group wanted to set up a data haven specializing in video documenting human rights abuses by dictatorships. Naturally, if you do this, you have to anonymize the videos so the dictatorship can’t find the people in the video and punish them; you have to have some scheme for accepting video from people who don’t have unfiltered access to the net (they suggested samizdat techniques and dead drops); and you have to decide which videos are actually showing abuses (the cat videos are easy to weed out, but the security cam footage of someone getting mugged…not so much). The fourth group wanted to set up a clearinghouse for redacting leaked classified documents—there is no plausible way to put the Wikileaks genie back in the bottle, but (we hope) everyone agrees that ruining the life of J. Afghani who did a little translation work for the U.S. Army is not what we do, so maybe there could be an organization that talks off-the-record to both leakers and governments and takes care of making sure the names are removed.

It seems to me that while the sources are different, the redactions that should be done are more or less the same in both cases. It also seems to me that an organization that redacts for people—whoever they are, wherever the documents came from—is at grave risk of regulatory capture by the governments giving advice on what needs redacted. The panelists made an analogy to the difficulty of getting the UN to pass any resolution with teeth, and Clay Shirky suggested that what is really wanted here is a best-practices document enabling the leakers to do their own redactions; I’d add that this also puts the authors behind the veil of ignorance so they’re much less likely to be self-serving about it.

I’ve saved the second proposal for last because it’s the most personally interesting. They want to cut down on trolling and other toxic behavior on forums and other sites that allow participation. Making another analogy to offline practice, they point out that a well-run organization doesn’t allow just anyone who shows up to vote for the board of directors; new members are required to demonstrate their commitment to the organization and its values, usually by sticking around for several years, talking to older members, etc. Now, on the internets, there are some venues that can already do this. High-traffic discursive blogs like Making Light, Slacktivist, and Crooked Timber cultivate good dialogue by encouraging people to post under the same handle frequently. Community advice sites like StackOverflow often have explicit reputation scores which members earn by giving good advice. But if you’re a little bitty blog like this one, your commenters are likely to have no track record with you. In some contexts, you could imagine associating all the site-specific identities that use the same OpenID authenticator; StackOverflow’s network of spinoffs does this. But in other contexts, people are adamant about preserving a firewall between the pseudonym they use on one site and those they use elsewhere; witness what happened when Blizzard Entertainment tried to require real names on their forums. The proposal tries to solve all these issues with a trusted intermediary that aggregates reputation information from many sites and produces a “credibility score” that you can take wherever you wish to comment. Like a credit score, the details of how the score was computed are not available, so you can’t deduce someone’s identity on any other site. Further, you can have as many separate, unconnectable pseudonyms as you want, all with the same score.

People will try to game any such system, but that’s actually the easy problem, addressable with clever algorithms and human moderators. The more serious problem in my book is, “will produce quality comments” isn’t the sort of thing that you can reduce to a single number. To give an extreme example, the sort of comment that gets you mad props on /b/ is exactly what most other sites do not want. The team did propose to break it down as three or four numbers, but it’s not clear to me that that helps enough. (But if you expose too much detail to sites trying to consume the data, that may leave them unable to reach a conclusion.) And finally, anonymization of this kind of data is much harder than it looks: I need only point at the successful unmasking of two users within the Netflix Challenge data set. Anonymization is in tension with utility here, because the more information you expose about what sort of reputation someone has on which sites, the easier it becomes to unmask them.

I think the idea is not totally doomed, though. We could help it a great deal by turning it on its head: rate sites on the quality of their discourse. This would be done with a publicly documented, but subject to revision, scoring scheme that humans execute against a random sample of pages from the site; we might be able to use a set of seed scores to train some sort of expert system to do it automatically, but I think it’s not a disaster if we have to have humans do the site evaluations. This would be useful in itself, in that it would be a stick to beat sites with when their discourse is terrible. Meantime, each site exports its existing member-reputation scheme (or makes one up—even something simple like “average number of posts per month” would probably be useful) in a standard format. When you want to introduce yourself in a new context, you can bring along a “recommendation” from any number of sites of your choice, which is just each site’s discourse score + your reputation on that site. It is explicit in the UX for this that you are linking your identity on the new site to your identity on the others (I might even go as far as allowing people to click through to your posting history on the other sites). You then get some reputation spillover on the new site from that, which might be as limited as “doesn’t go through the mod queue the first time.” Contrariwise, if you don’t provide any recommendations, your new pseud gets to stay dissociated from your other identities, but doesn’t get any rep. Sprinkle with crypto, nonrepudiation schemes, and human moderator feedback as necessary.

Other people’s comments on this event can be found under Twitter hashtag #4ideas. Zittrain and his colleagues intend to do this again next year, and I look forward to seeing what they come up with next time.

Strawman: MIME type for fonts

For a little while now, it’s been possible for websites to embed fonts that all major browsers will pick up on. This of course implies fonts being served as HTTP resources. But it turns out that nobody has bothered to assign any of the common font formats a MIME type.1 Fonts being embedded on the web nowadays come in two flavors and three kinds of container: you either have TrueType or PostScript CFF-style outline glyphs, and they are in a bare “OpenType” (really sfnt) container, or else compressed with either WOFF or EOT. (I am ignoring SVG fonts, which are spottily supported and open several cans of worms that I don’t want to get into right now.) In the future, people might also want to embed TTC font collections, which are also in a sfnt container and could thus also be compressed with WOFF—not sure about EOT there—and bare PostScript Type 1 fonts, but neither of these is supported in any browser at present, as far as I know. There is no official MIME type for any of these combinations; therefore, people deploying fonts over HTTP have been making them up. Without trying very hard, I found real sites using all of: application/ttf, application/otf, application/truetype, application/opentype, application/woff, application/eot, any of the above with an x-prefix, or any of the above in font/ instead of application/ (with or without the x-). There is no top-level font MIME category, making this last particularly egregious.

All of these made-up types work because browsers don’t pay any attention to the content type of a web-embedded font; they look at the data stream, and if it’s recognizably a font, they use it. Such “sniffing” has historically caused serious problems—recall my old post regarding CSS data theft—so you might expect me to be waving red flags and arguing for the entire feature to be pulled until we can get a standard MIME category for fonts, standard subtypes for the common ones, and browsers to start ignoring fonts served with the wrong type. But I’m not. I have serious misgivings about the whole “the server-supplied Content-Type header is gospel truth, content sniffing is evil” thing, and I think the font situation makes a nice test case for moving away from that model a bit.

Content types are a security issue because many of the file formats used on the web are ambiguous. You can make a well-formed HTML document that is simultaneously a well-formed CSS style sheet or JavaScript program, and attackers can and have taken advantage of this. But this isn’t necessarily the case for fonts. The sfnt container and its compressed variants are self-describing, unambiguously identifiable binary formats. Browsers thoroughly validate fonts before using them (because an accidentally malformed font can break the OS’s text drawing code), and don’t allow them to do anything but provide glyphs for text. A good analogy is to images: browsers also completely ignore the server’s content-type header for anything sent down for an <img>, and that doesn’t cause security holes—because images are also in self-describing binary formats, are thoroughly validated before use, and can’t do anything but define the appearance of a rectangle on the screen. We do not need filtering on the metadata, because we have filtering on the data itself.

Nonetheless, there may be value in having a MIME label for fonts as opposed to other kinds of binary blobs. For instance, if the server doesn’t think the file it has is a font, shouldn’t it be able to convince the browser of that, regardless of whether the contents of the file are indistinguishable from a font? (Old hands may recognize this as one of the usual rationales for not promoting text/plain to text/html just because the HTTP response body happens to begin with <!DOCTYPE.) The current draft standard algorithm for content sniffing takes this attitude with images, recommending that browsers only treat HTTP responses as images if their declared content type is in the image/ category, but ignore the subtype and sniff for the actual image format. With that in mind, here’s my proposal: let’s standardize application/font as the MIME type for all fonts delivered over the Internet, regardless of their format. Browsers should use only fonts delivered with that MIME type, but should detect the actual format based on the contents of the response body.

I can think of two potential problems with this scheme. First, it would be good if browsers could tell servers (using the normal Accept: mechanism) which specific font formats they understand. Right now, it’s reasonable to insist that browsers should be able to handle either TrueType or PostScript glyph definitions, in either bare sfnt or compressed WOFF containers, and ignore the other possibilities, but that state won’t endure forever. SVG fonts might become useful someday (if those cans of worms can be resolved to everyone’s satisfaction), or someone might come up with a new binary font format that has genuine advantages over OpenType. I think this should probably be handled with accept parameters, for instance Accept: application/font;container=sfnt could mean “I understand all OpenType fonts but no others”. The other problem is, what if someone comes up with a font format that can’t reliably be distinguished from an OpenType font based on the file contents? Well, this is pretty darn unlikely, and we can put it into the RFC defining application/font that future font formats need to be distinguishable or else get their own MIME type. The sfnt container keeps its magic number (and several other things that ought to be in the file header) in the wrong place, but as long as all the other font formats that we care about put their magic number at the beginning of the file where it belongs, that’s not a problem.


1 To be precise, there is a standard MIME type for a font format: RFC 3073 defines application/font-tdpfr for the Bitstream PFR font format, which nobody uses anymore, except possibly some proprietary television-related products. Bitstream appear to have been trying to get it used for web fonts back in the days of Netscape 4, and then to have given up on it, probably because the font foundries’ attitude was NO YOU CAN’T HAS LICENSE FOR WEBS until just last year.