Research

Notes and essays about the topics I am currently doing academic research on. For the past decade this has been computer and network security, often having something to do with the (ab)use of the Internet for censorship and surveillance.

How To Choose Passwords

When I talk to people who aren’t security researchers about history sniffing, they want to know whether they should worry about it, and I say no: the only thing you can do to protect yourself is use the latest version of your favorite browser, which you should do anyway; besides, the interactive attacks will probably never appear in the wild. But if I only ever talk about computer security topics that are only relevant to researchers, I’m not helping people as much as I could, and I’m scaring them about things they can’t control. So this post is about something you should worry about, because it’s under your direct control; lots of people do it poorly and that does make them less safe online; and it’s easy to do well. That thing is choosing passwords.

You have probably heard that you shouldn’t reuse the same password on many different websites, and that your passwords should be long, contain numbers and punctuation, and avoid dictionary words. But you probably haven’t heard anyone explain why, and you probably have noticed that these two pieces of advice are hard to follow at the same time, because long gibberish passwords are hard to remember even if you only have one of them. I’m going to tell you why you should do these things, and how to do them without too much grief.

Don’t use the same password on many different websites

No matter how good your password is, the bad guys might discover what it is. For instance, if you log into an unencrypted website over an unencrypted wireless network, anyone else on the same wireless network can listen in on the radio traffic and discover your password. (It’s just like eavesdropping on a private conversation.) Or you might accidentally type your password into a website that looks like the real thing but is actually a fake created to trick you.

Suppose the bad guys have discovered your password for a Web forum. That’s not a big deal, because someone impersonating you on one forum probably isn’t a big deal. You might have to apologize to some people for letting some schmuck insult them while pretending to be you. But the bad guys know that people often use the same password on many different websites, so they’re going to try to log into your email with that password, and your bank, and so on. If they succeed—if you did use the same password—they might be able to ruin your life, or at least steal some of your money. But if you always use different passwords on different websites, the bad guys have to discover the password you use for your bank (and nothing else) in order to steal your money.

How do you manage to remember lots of different passwords, especially when (as I’m about to explain) they all need to be long and complicated? The best way is to let the computer—specifically, your browser’s password manager—do it for you. This may seem unsafe, but it’s actually much safer than using the same password for everything. The password manager cannot be fooled by phishing sites, and it has no trouble remembering lots of long complicated passwords. Yes, all the passwords are in a file on your computer. But the only way the bad guys can get at that is by physically stealing your computer, or installing spyware on it remotely. If you keep your computer up to date with security patches, you don’t have to worry about spyware much. If your computer is in danger of being physically stolen (e.g. it’s a laptop) you should use the master password mode of your browser’s password manager, so that the file on your computer is encrypted. Whether or not you have to worry about theft, you should enable Sync, or equivalent feature, even if you have no other computer to sync with; that way, if your computer breaks, there’s still a backup of all your passwords out there in the cloud (safely encrypted).

Use long, complicated passwords

The other way the bad guys discover passwords is by breaking into servers that store entire databases of them. If these databases have been designed correctly, that doesn’t tell them anything by itself, because the passwords are hashed. Hashing deserves a little explanation: suppose my password on some site is 12345 (the kind of thing that an idiot would have on his luggage). The server doesn’t store 12345 in its database, it stores 827ccb0eea8a706c4c34a16891f84e7b, which is the result of running 12345 through a cryptographic hash, in this case MD5. It’s easy to convert a password into its hash, but it’s prohibitively hard to do the reverse. MD5 is old and no longer considered a good choice for passwords (or anything, for that matter), but there is still no known algorithm to take an arbitrary MD5 hash and reveal an input that produces that hash, other than guess-and-check.

So the bad guys can’t just read the passwords from a database once they have it. But they can guess passwords, run the guesses through MD5 (or whatever was used), and compare the results to the database entries. (They can guess passwords even if they haven’t stolen a database, by feeding the guesses to the site’s login form—but that’s much slower and the site admins are likely to notice.) 12345 isn’t a good password because it’s easy to guess—but so is any five-digit number: a cheap laptop can calculate the MD5 of all 100,000 five-digit (or smaller) numbers in less than a second. There are something like 250,000 words in English—that’s maybe five seconds’ worth of work for the same laptop—so any word in the dictionary is bad, too. You can buy a 40-million-entry word list for $30 that has not only all the words in 20 different languages, but mangled versions of them (e.g. f0od)—that might take an hour or two to process.

The longer and more complicated your password is, the harder it is to guess; but that makes it harder to remember as well. Adding punctuation and numbers doesn’t help as much as one would like. There are 95 characters that you can type on a US keyboard, so there are 958, or about a quadrillion (short scale) possible eight-character passwords, if you use all those characters. A quadrillion possibilities is out of the reach of a cheap laptop, but it’s a few weeks’ effort for a small cluster of beefy computers—a determined bad guy could do this for maybe $25,000.

The good news is, you can have passwords that can’t be guessed this way but are still easy to remember. The trick is to use phrases rather than words. One random English word is 250,000 possibilities. Two random English words are 62.5 billion possiblities—250,000 squared. That’s still not enough. But ten random English words is 250,00010 ≈ 1054 possibilities, which is big enough that a modern supercomputer tasked with the problem would still be guessing when the Sun burns out five billion years from now.

You can’t take just any phrase, though. The bad guys could easily try every phrase in the Oxford Dictionary of Quotations, because there are only 20,000 of them. I haven’t worked out the math, but I think guessing every sentence in the complete works of Shakespeare is doable. But nobody has a database of every sentence in every work of literature that was written with the Latin alphabet. A phrase taken from somewhere in the middle of an obscure but lengthy book is a good choice. Or you could follow this procedure:

  1. Go to Wikipedia and click on random article. (You can use any site with a random article feature for this step, if you’d rather.)
  2. Copy the URL of the page you get, and paste it into the Eater of Meaning. Leave the drop-down on Eat word endings.
  3. Choose ten consecutive words from the result. They don’t have to all come from the same sentence.

Don’t worry about finding a sentence that you can remember yourself, because you’re going to have the password manager do it (unless you’re trying to pick the master password).

Some sites have limits on the length of their passwords. This is bad, and you should complain; but until they fix it, just use the first letter of each word in your ten-word phrase, with some numbers and punctuation if they insist on numbers and punctuation. That kind of password is theoretically crackable, as I said earlier, but it’s likely to be better than lots of other passwords in the database. So if the bad guys get the database, they will crack so many other people’s passwords before they get to yours that they don’t feel they have to bother cracking yours. (It’s kind of like the joke about how fast you need to run away from a lion.)

If there’s no limit on the length of the password, but the site still insists on numbers and/or punctuation, put them in between the words; that’s easier to type.

Interactive history sniffing and its relatives

Readers of this blog will probably already know that, up till the middle of last year, it was possible to sniff browsing history by clever tricks involving CSS, JavaScript, and the venerable tradition of drawing hyperlinks to already-visited URLs in purple instead of blue. Last year, though, David Baron came up with a defense against history sniffing which has now been adopted by every major browser except Opera. One fewer thing to worry about when visiting the internets, hooray? Not so fast.

Imagine for a moment that the next time you visited an unfamiliar website and you wanted to leave a comment without creating an account, instead of one of those illegibly distorted codes that you have to type back in, you saw this:

Please click on all the chess pawns.

A six-by-six checkerboard grid with chess pawns in random locations.  One of the pawns is green and has a mouse-cursor arrow pointing to it.

As you click on the pawns, they turn green. Nifty, innit? Much easier than an illegibly distorted code. Also easy for a spambot equipped with image processing software, but it turns out the distorted codes are not that hard for spambots anymore either and probably no one’s written the necessary image processing code for this one yet. Possibly also easier on people with poor eyesight, and there could still be a link to an audio challenge for people with no eyesight.

… What’s this got to do with history sniffing? That chessboard isn’t really a CAPTCHA. All the squares have pawns on them. But each one is a hyperlink, and the pawns linked to sites you haven’t visited are being drawn in the same color as the square, so they’re invisible. You only click on the pawns you can see, of course, and so you reveal to the site which of those URLs you have visited. A little technical cleverness is required—the pawns have to be Unicode dingbats, not images; all the normal interactive behavior of hyperlinks has to be suppressed; etcetera—but nothing too difficult. I and three other researchers with CMU Silicon Valley’s Web Security Group have tested this and a few other such fake CAPTCHAs on 300 people. We found them to be practical, although you have to be careful not to make the task too hard; for details please see our paper (to be presented at the IEEE Symposium on Security and Privacy, aka Oakland 2011).

An attacker obviously can’t use an interactive sniffing attack like this one to find out which sites out of the entire Alexa 10K your victim has visited—nobody’s going to work through that many chessboards—and for the same reason, deanonymization attacks that require the attacker to probe hundreds of thousands of URLs are out of reach. However, an attacker could reasonably probe a couple hundred URLs with an interactive attack, and according to Dongseok Jang’s study of actual history sniffing (paper), that’s about how many URLs real attackers want to sniff. It seems that the main thing real attackers want to know about your browsing history is which of their competitors you patronize, and that’s never going to need more than a few dozen URLs.

On the other hand, CAPTCHAs are such a hassle for users that they cause 10% to 33% attrition in conversion rates. And users don’t expect to see them on every visit to a site—just the first, usually, or each time they submit an anonymous comment. Even websites that were sniffing history when it was possible to do so automatically, and want to keep doing it, may consider that too high a price. But we can imagine similar attacks on higher-value information, where even a tiny success rate would be worth it. For instance, a malicious site could ask you to type a string of gibberish to continue—which happens to be your Amazon Web Services secret access key, IFRAMEd in from their management console. Amazon has taken steps to make this precise scenario difficult, but I’m not prepared to swear that it’s impossible, and other cloud services providers may have been less cautious.

Going forward, we also need to think carefully about how new web-platform capabilities might enable attackers to make similar end-runs around the browser’s security policies. In the aforementioned research project, we were also able to sniff history without user interaction by using a webcam to detect the color of the light reflecting off the user’s face; even with our remarkably crude image processing code this worked great as long as the user held still. It’s not terribly practical, because the user has to grant access to their webcam, and it involves putting an annoying flashing box on the screen, but it demonstrates the problem. We are particularly concerned about WebGL right now, since its shader programs can perform arbitrary computations and have access to cross-domain content that page JavaScript cannot see; there may well be a way for them to communicate back to page JavaScript that avoids the <canvas> element’s information leakage rules. Right now it’s not possible to put the rendering of a web page into a GL texture, so this couldn’t be used to snoop on browsing history, but there’s legitimate reasons to want to do that, so it might become possible in the future.

Securing the future net

Today I had the fortune to attend a group discussion ambitiously entitled Future of Internet Security at Mozilla. What this was mostly about was, given that a recent incident has severely shaken everyone’s confidence in the PKIX (PDF, say sorry) mechanism that everyone currently uses to decide that a secure website is who it says it is, what can we do about it? I’m not going to attempt to summarize; instead I’m going to point at the Etherpad log and [2016: the Etherpad log is no longer available either from Mozilla or the Internet Archive] video record of the discussion, then plow boldly forward with my own (incontrovertibly correct, of course) opinion on the way forward, on the assumption that everyone who reads this will already be familiar enough with the context to know what I’m talking about.

I will quote in full the principles with which the discussion was kicked off, though (really they’re more like constraints on solutions acceptable to all parties).

  • Performance - large sites will not adopt solutions which bulk up the amount of data required to be exchanged to establish an secure connection.
  • Independence/Availability - large sites will not accept tying the uptime of their site to the uptime of infrastructure over which they have no control (e.g. an OCSP responder)
  • Accessibility/Usability - solutions should not put the cost of security, either in terms of single sites or large deployments, out of the reach of ordinary people
  • Simplicity - solutions should be simple to deploy, or capable of being made simple.
  • Privacy - ideally, web users should not have to reveal their browsing habits to a third party.
  • Fail-closed - new mechanisms should allow us to treat mechanism and policy failures as hard failures (not doing so is why revocation is ineffective) (however this is trading off security for availability, which has historically proven almost impossible).
  • Disclosure - the structure of the system should be knowable by all parties, and users must know the identities of who they are trusting

I should probably emphasize that this is a walk, do not run, to the exits situation. The status quo is dire, but we can afford to take the time to come up with a solution that solves the problem thoroughly; we do not need an emergency stopgap. Despite that, I think the short-term solution will be different from the long-term solution.

In the short term, the solution with the most traction, and IMO the best chance of actually helping, is DANE, an IETF draft standard for putting TLS server keys in the DNS. This can (at least on paper) completely replace the common DV certificates issued by traditional certificate authorities. However, to offer real security improvements relative to the status quo, I assert that the final version of the spec needs to:

  • Require clients to fail closed on any sort of validation failure. The current text of the spec does say this, but not clearly and not with enough RFC2119 MUSTs.
  • Provide exclusion (trust no server keys but these, possibly also trust no CA but these) rather than inclusion (you should trust this server key). The current text of the spec can be read either way. A vocal minority of the DANE working group wants inclusion. It is my considered opinion that inclusion is completely useless—all it does is add the DNS root signing key to the existing pool of trusted CAs, which doesn’t solve the untrustworthy CA problem.
  • Require the use of DNSSEC. It has recently been suggested that a signed DNS zone is not necessary for exclusion, but then a DNS-tampering attacker can deny service by injecting a bogus DANE record, which will deter deployment. (It doesn’t matter that a DNS-tampering attacker can also deny service by messing up the A records; this is a new risk, which scares people more than an existing risk.)
  • Clearly indicate that it does not provide EV-level validation, leaving a business model for traditional CAs to retreat to.

In the longer term, I think we’re going to want to move to some sort of content-based addressing. DANE gets rid of the CA mess, but it substitutes the DNS as a single point of failure. Here’s a half-baked scheme that we could start rolling out real soon for URIs that don’t need to be user comprehensible:

<!-- jQuery 1.5.2 -->
<script src="h:sha1,b8dcaa1c866905c0bdb0b70c8e564ff1c3fe27ad"></script>

The browser somehow knows how to expand h: URIs to something it can go ask a server on the net for. What the server produces MUST be discarded if it does not have the specified hash (and the browser can go try some other server). We don’t need to worry about where the browser got the content or whether it was transferred under encryption—if it’s not what was wanted, it’ll fail the hash check. Still to be worked out: how to do the expansion without reintroducing that single point of failure; how to disseminate these URIs; how to fit dynamic content into the scheme; under what circumstances h: URIs should, or should not, be considered same-origin with the requesting page.

Four Ideas for a Better Internet 2011

On Tuesday night I attended a talk at Stanford entitled Four Ideas for a Better Internet. Four groups of Harvard and Stanford Law students, having just completed a special seminar entitled Difficult Problems in Cyberspace, each presented a proposed improvement to the internets; they were then grilled on said proposal by a panel of, hm, let’s call them practitioners (many but not all were from the industry). Jonathan Zittrain moderated. In general, I thought all of the proposals were interesting, but none of them was ready to be implemented; they probably weren’t intended to be, of course, but I—and the panelists—could poke pretty serious holes in them without trying very hard.

The first proposal was to improve social network security by allowing you to specify a group of extra-trusted friends who could intervene to protect your social-network presence if it appeared to have been hijacked, or who could vouch for a request you might make that requires extra verification (for instance, a request to change the email address associated with your account). This is quite intentionally modeled on similar practices found offline; they made an analogy to the (never-yet used) procedure in section 4 of the 25th amendment to the U.S. Constitution which allows the Vice President, together with a majority of the Cabinet, to declare the President temporarily unable to do his job. It’s not a bad idea in principle, but they should have looked harder at the failure modes of those offline practices—A25§4 itself goes on to discuss what happens if the President objects to having been relieved of duty (Congress has to decide who’s right). More down-to-earth, one might ask whether this is likely to make messy breakups worse, and why the hey, moderators, this account looks like it’s been hijacked button (not to be confused with the hey, moderators, this account appears to belong to a spammer button) couldn’t be available to everyone.

The third and fourth proposals were less technical, and quite closely related. The third group wanted to set up a data haven specializing in video documenting human rights abuses by dictatorships. Naturally, if you do this, you have to anonymize the videos so the dictatorship can’t find the people in the video and punish them; you have to have some scheme for accepting video from people who don’t have unfiltered access to the net (they suggested samizdat techniques and dead drops); and you have to decide which videos are actually showing abuses (the cat videos are easy to weed out, but the security cam footage of someone getting mugged…not so much). The fourth group wanted to set up a clearinghouse for redacting leaked classified documents—there is no plausible way to put the Wikileaks genie back in the bottle, but (we hope) everyone agrees that ruining the life of J. Afghani who did a little translation work for the U.S. Army is not what we do, so maybe there could be an organization that talks off-the-record to both leakers and governments and takes care of making sure the names are removed.

It seems to me that while the sources are different, the redactions that should be done are more or less the same in both cases. It also seems to me that an organization that redacts for people—whoever they are, wherever the documents came from—is at grave risk of regulatory capture by the governments giving advice on what needs redacted. The panelists made an analogy to the difficulty of getting the UN to pass any resolution with teeth, and Clay Shirky suggested that what is really wanted here is a best-practices document enabling the leakers to do their own redactions; I’d add that this also puts the authors behind the veil of ignorance so they’re much less likely to be self-serving about it.

I’ve saved the second proposal for last because it’s the most personally interesting. They want to cut down on trolling and other toxic behavior on forums and other sites that allow participation. Making another analogy to offline practice, they point out that a well-run organization doesn’t allow just anyone who shows up to vote for the board of directors; new members are required to demonstrate their commitment to the organization and its values, usually by sticking around for several years, talking to older members, etc. Now, on the internets, there are some venues that can already do this. High-traffic discursive blogs like Making Light, Slacktivist, and Crooked Timber cultivate good dialogue by encouraging people to post under the same handle frequently. Community advice sites like StackOverflow often have explicit reputation scores which members earn by giving good advice. But if you’re a little bitty blog like this one, your commenters are likely to have no track record with you. In some contexts, you could imagine associating all the site-specific identities that use the same OpenID authenticator; StackOverflow’s network of spinoffs does this. But in other contexts, people are adamant about preserving a firewall between the pseudonym they use on one site and those they use elsewhere; witness what happened when Blizzard Entertainment tried to require real names on their forums. The proposal tries to solve all these issues with a trusted intermediary that aggregates reputation information from many sites and produces a credibility score that you can take wherever you wish to comment. Like a credit score, the details of how the score was computed are not available, so you can’t deduce someone’s identity on any other site. Further, you can have as many separate, unconnectable pseudonyms as you want, all with the same score.

People will try to game any such system, but that’s actually the easy problem, addressable with clever algorithms and human moderators. The more serious problem in my book is, will produce quality comments isn’t the sort of thing that you can reduce to a single number. To give an extreme example, the sort of comment that gets you mad props on /b/ is exactly what most other sites do not want. The team did propose to break it down as three or four numbers, but it’s not clear to me that that helps enough. (But if you expose too much detail to sites trying to consume the data, that may leave them unable to reach a conclusion.) And finally, anonymization of this kind of data is much harder than it looks: I need only point at the successful unmasking of two users within the Netflix Challenge data set. Anonymization is in tension with utility here, because the more information you expose about what sort of reputation someone has on which sites, the easier it becomes to unmask them.

I think the idea is not totally doomed, though. We could help it a great deal by turning it on its head: rate sites on the quality of their discourse. This would be done with a publicly documented, but subject to revision, scoring scheme that humans execute against a random sample of pages from the site; we might be able to use a set of seed scores to train some sort of expert system to do it automatically, but I think it’s not a disaster if we have to have humans do the site evaluations. This would be useful in itself, in that it would be a stick to beat sites with when their discourse is terrible. Meantime, each site exports its existing member-reputation scheme (or makes one up—even something simple like average number of posts per month would probably be useful) in a standard format. When you want to introduce yourself in a new context, you can bring along a recommendation from any number of sites of your choice, which is just each site’s discourse score + your reputation on that site. It is explicit in the UX for this that you are linking your identity on the new site to your identity on the others (I might even go as far as allowing people to click through to your posting history on the other sites). You then get some reputation spillover on the new site from that, which might be as limited as doesn’t go through the mod queue the first time. Contrariwise, if you don’t provide any recommendations, your new pseud gets to stay dissociated from your other identities, but doesn’t get any rep. Sprinkle with crypto, nonrepudiation schemes, and human moderator feedback as necessary.

CCS 2010, day 1

I’m attending the 2010 ACM Conference on Computer and Communications Security (in Chicago this year), from yesterday (I’m skipping the workshops on Monday and Friday). Was a little to tired to write up what I thought of yesterday’s talks yesterday, so here are some brief thoughts about them now.

Before lunch, I probably should have gone to the security analysis session, but I really wanted to see Justin Samuel’s talk on practical advice for dealing with compromised keys, mostly aimed at people doing signed software distribution—which could also be relevant for people running secure web sites, especially if browsers start paying more attention to changes in the server certificates. The other two talks in this session didn’t really grab me.

After lunch, there was lots of good stuff in the session on wireless and phone security. Husted and Myers described how a malicious group of cooperating cell phones can track the majority of other cell phone users in an area—this is not easy now but will only get easier. Halevi and Saxena (no link available) comprehensively broke all the current schemes for acoustically pairing small widgets together (you put your Bluetooth earbud against your phone, for instance, and it vibrates a code, which your phone detects), even at distance thanks to the magic of parabolic microphones. And a large group from Georgia Tech showed their technique for fingerprinting the networks through which a phone call passes, based on the characteristics of each network’s acoustic compression algorithm.

After that I decided to skip the tutorials and go for a walk. The conference hotel is right on the south bank of the Chicago River and only a few blocks from Lake Michigan, so I walked down the riverfront to the lake and then looped around to the south and back. I’ve never been to Chicago before and it’s very interesting, architecturally. I will post more on this when I can upload photos (left the cable at home, silly me).

The poster session was, unfortunately, a bit of a blur; by that point my brain was full. Collin introduced me to a bunch of people doing ’net security at Cal and we all went out to dinner, which involved more wandering around the downtown looking for a restaurant that had a table for eight.