I spoke briefly at PETS 2014 about which websites are censored in which countries, and what we can learn just from the lists.
These long chewy posts are more about Internet security (which is my field of research) than they are about HTML, but often they are about both.
For all countries for which Herdict contains enough reports to be credible (concretely, such that the error bars below cover less than 10% of the range), the estimated probability that a webpage will be inaccessible. Vertically sorted by the left edge of the error bar. Further right is worse. I suspect major systemic errors in this data set, but it’s the only data set in town.
|total||5 838 383||100.000|
|ok||2 212 565||37.897|
|ok (redirected)||1 999 341||34.245|
|network or protocol error||798 231||13.672|
|hostname not found||166 623||2.854|
|page not found (404/410)||110 241||1.888|
|forbidden (403)||75 054||1.286|
|service unavailable (503)||18 648||.319|
|server error (500)||15 150||.259|
|bad request (400)||14 397||.247|
|authentication required (401)||9 199||.158|
|redirection loop||2 972||.051|
|proxy error (502/504/52x)||1 845||.032|
|other HTTP response||1 010||.017|
|syntactically invalid URL||19||.000|
Sorry about the non-tabular figures.
For a while now, when people ask me how they can improve their websites’ security, I tell them: Start by turning on HTTPS for everything. Run a separate server on port 80 that issues nothing but permanent redirects to the
https:// version of the same URL. There’s lots more you can do, but that’s the easy first step. There are a number of common objections to this plan; today I want to talk about the
it should be the user’s choice objection, expressed for instance in Robert L. Mitchell. It goes something like this:
Why should I (the operator of the website) assume I know better than each of my users what their security posture should be? Maybe this is athrowawayaccount, of no great importance to them. Maybe they are on a slow link that is intrinsically hard to eavesdrop upon, so the extra network round-trips involved in setting up a secure channel make the site annoyingly slow for no benefit.
This objection ignores the
public health benefits of secure channels. I’d like to make an analogy to immunization, here. If you get vaccinated against the measles (for instance), that’s good for you because you are much less likely to get the disease yourself. But it is also good for everyone who lives near you, because now you can’t infect them either. If enough people in a region are immune, then nobody will get the disease, even if they aren’t immune; this is called herd immunity. Secure channels have similar benefits to the general public—unconditionally securing a website improves security for everyone on the ’net, whether or not they use that website! Here’s why.
Most of the criminals who
crack websites don’t care which accounts they gain access to. This surprises people; if you ask users, they often say things like
well, nobody would bother breaking into my email / bank account / personal computer, because I’m not a celebrity and I don’t have any money! But the attackers don’t care about that. They break into email accounts so they can send spam; any
@gmail.com address is as good as any other. They break into bank accounts so they can commit credit card fraud; any given person’s card is probably only good for US$1000 or so, but multiply that by thousands of cards and you’re talking about real money. They break into PCs so they can run botnets; they don’t care about data stored on the computer, they want the CPU and the network connection. For more on this point, see the paper Rick Wash. This is the most important reason why security needs to be unconditional. Accounts may be
throwaway to their users, but they are all the same to the attackers.
Often, criminals who
crack websites don’t care which websites they gain access to, either. The logic is similar: the legitimate contents of the website are irrelevant. All the attacker wants is to reuse a legitimate site as part of a spamming scheme or to copy the user list, guess the weaker passwords, and try those username+password combinations on
more important websites. This is why everyone who has a website, even if it’s tiny and attracts hardly any traffic, needs to worry about its security. This is also why making websites secure improves security for everyone, even if they never intentionally visit that website.
Now, how does HTTPS help with all this? The easiest several ways to break into websites involve snooping on unsecured network traffic to steal user credentials. This is possible even with the common-but-insufficient tactic of sending only the login form over HTTPS, because every insecure HTTP request after login includes a piece of data called a
session cookie that can be stolen and used to impersonate the user for most purposes without having to know the user’s password. (It’s often not possible to change the user’s password without also knowing the old password, but that’s about it. If an attacker just wants to send spam, and doesn’t care about maintaining control of the account, a session cookie is good enough.) It’s also possible even if all logged-in users are served only HTTPS, but you get an unsecured page until you login, because then an attacker can modify the unsecured page and make it steal credentials. Only applying channel security to the entire site for everyone, whoever they are, logged in or not, makes this class of attacks go away.
Unconditional use of HTTPS also enables further security improvements. For instance, a site that is exclusively HTTPS can use the Strict-Transport-Security mechanism to put browsers on notice that they should never communicate with it over an insecure channel: this is important because there are turnkey
httponly which cuts off more ways for attackers to steal user credentials. And if a site runs complicated code on the server, exposing that code to the public Internet two different ways (HTTP and HTTPS) enlarges the server’s attack surface. If the only thing on port 80 is a boilerplate
try again with HTTPS permanent redirect, this is not an issue. (Bonus points for invalidating session cookies and passwords that just went over the wire in cleartext.)
Finally, I’ll mention that if a site’s users can turn security off, then there’s a per-user toggle switch in the site’s memory banks somewhere, and the site operators can flip that switch off if they want. Or if they have been, shall we say, leaned on. It’s a lot easier for the site operators to stand up to being leaned on if they can say
that’s not a thing our code can do.
Alexa publishes a list of
the top 1,000,000 sites on the web. Here is some trivia about this list (as it was on September 27, 2013):
- No entries contain an URL scheme.
- Only 247 entries contain the string
- Only 13,906 entries contain a path component.
- There are 987,661 unique hostnames and 967,933 unique domains (public suffix + 1).
- If you tack
on the beginning of each entry and
on the end (if there wasn’t a path component already), then issue a GET request for that URL and chase HTTP redirects as far as you can (without leaving the site root, unless there was a path component already), you get 916,228 unique URLs.
- Of those 916,228 unique URLs, only 352,951 begin their hostname component with
and only 14,628 are HTTPS.
- 84,769 of the 967,933 domains do not appear anywhere in the list of canonicalized URLs; these either redirected to a different domain or responded with a network or HTTP error.
- 52,139 of those 84,769 domains do respond to a GET request if you tack
on the beginning of the domain name and then proceed as above.
- But only 41,354 new unique URLs are produced; the other 10,785 domains duplicate entries in the earlier set.
- 39,966 of the 41,354 new URLs begin their hostname component with
- 806 of the new URLs are HTTPS.
- Merging the two sets produces 957,582 unique URLs (of which 392,917 begin the hostname with
and 15,434 are HTTPS), 947,474 unique hostnames and 928,816 unique domains.
- 42,734 registration names (that is, the +1 component in a
public suffix + 1name) appear in more than one public suffix. 11,748 appear in more than two; 5516 in more than three; 526 in more than ten.
- 44,299 of the domains in the original list do not appear in the canonicalized set.
- 5,183 of the domains in the canonicalized set do not appear in the original list.
Today’s exercise in data cleanup was brought to you by the I Can’t Believe This Took Me An Entire Week Foundation. If you ever need to do something similar, this script may be useful.
For the past several weeks a chunk of the news has been all about how the NSA in conjunction with various other US government agencies, defense contractors, telcos, etc. has, for at least seven years and probably longer, been collecting mass quantities of data about the activities of private citizens, both of the USA and of other nations. The data collected was largely what we call traffic analysis data: who talked to whom, where, when, using what mechanism. It was mostly not the actual contents of the conversations, but so much can be deduced from
who talked to whom, when that this should not reassure you in the slightest. If you haven’t seen the demonstration that just by compiling and correlating membership lists, the British government could have known that Paul Revere would’ve been a good person to ask pointed questions about revolutionary plots in the colonies in 1772, go read that now.
I don’t think it’s safe to assume we know anything about the details of this data collection: especially not the degree of cooperation the government obtained from telcos and other private organizations. There are too many layers of secrecy involved, there’s probably no one who has the complete picture of what the various three-letter agencies were supposed to be doing (let alone what they actually were doing), and there’s too many people trying to bend the narrative in their own preferred direction. However, I also don’t think the details matter all that much at this stage. That the program existed, and was successful enough that the NSA was bragging about it in an internal PowerPoint deck, is enough for the immediate conversation to go forward. (The details may become very important later, though: especially details about who got to make use of the data collected.)
Lots of other people have been writing about why this program is a Bad Thing: Most critically, large traffic-analytic databases are easy to abuse for politically-motivated witch hunts, which can and have occurred in the US in the past, and arguably are now occurring as a reaction to the leaks. One might also be concerned that this makes it harder to pursue other security goals; that it gives other countries an incentive to partition the Internet along national boundaries, harming its resilience; or that it further harms the US’s image abroad, which was already not doing that well; or that the next surveillance program will be even worse if this one isn’t stopped. (Nothing new under the sun: Samuel Warren and Louis Brandeis’ argument in
I want to talk about something a little different; I want to talk about why the secrecy of these ubiquitous surveillance programs is at least as harmful to good governance as the programs themselves.
Seems like every time I go to a security conference these days there’s at least one short talk where people are proposing to start over and rebuild the computer universe from scratch and make it simple and impossible to use wrong this time and it will be so awesome. Readers, it’s not going to work. And it’s not just a case of nobody’s going to put in enough time and effort to make it work. The idea is doomed from eight o’clock, Day One.
We all know from practical experience that a software module that’s too complicated is likely to harbor internal bugs and is also likely to induce bugs in the code that uses it. But we should also know from practice that a software module that’s too simple may work perfectly itself but will also induce bugs in the code that uses it!
One size fits all APIs are almost always too inflexible, and so accumulate a
scar tissue of workarounds, which are liable to be buggy. Is this an accident of our human fallibility? No, it is an inevitable consequence of oversimplification.
To explain why this is so, I need to talk a little about cybernetics. In casual usage, this word is a sloppy synonym for robotics and robotic enhancements to biological life (cyborgs), but as a scientific discipline it is the study of dynamic control systems that interact with their environment, ranging in scale from a simple closed-loop feedback controller to entire societies.1 The Wikipedia article is decent, and if you want more detail, the essay viable system model, is that a working system must be as least as complex as the systems it interacts with. If it isn’t, it will be unable to cope with all possible inputs. This is a theoretical explanation for the practical observation above, and it lets us put a lower bound on the complexity of a real-world computer system.
Let’s just look at one external phenomenon nearly every computer has to handle: time. Time seems like it ought to be an easy problem. Everyone on Earth could, in principle, agree on what time it is right now. Making a good clock requires precision engineering, but the hardware people have that covered; a modern $5 wristwatch could have earned you twenty thousand pounds in 1714. And yet the task of converting a count of seconds to a human-readable date and vice versa is so hairy that people write 500-page books about that alone, and the IANA has to maintain a database of time zones that has seen at least nine updates a year every year since 2006. And that’s just one of the things computers have to do with time. And handling time correctly can, in fact, be security-critical.
I could assemble a demonstration like this for many other phenomena whose characteristics are set by the non-computerized world: space, electromagnetic waves, human perceptual and motor abilities, written language, mathematics, etc. etc. (I leave the biggest hairball of all—the global information network—out, because it’s at least nominally in-scope for these radical simplification projects.) Computers have to cope with all of these things in at least some circumstances, and they all interact with each other in at least some circumstances, so the aggregate complexity is even higher than if you consider each one in isolation. And we’re only considering here things that a general-purpose computer has to be able to handle before we can start thinking about what we want to use it for; that’ll bring in all the complexity of the problem domain.
To be clear, I do think that starting over from scratch and taking into account everything we’ve learned about programming language, OS, and network protocol design since 1970 would produce something better than what we have now. But what we got at the end of that effort would not be notably simpler than what we have now, and although it might be harder to write insecure (or just buggy) application code on top of it, it would not be impossible. Furthermore, a design and development process that does not understand and accept this will not produce an improvement over the status quo.
1 The casual-use meaning of
cybernetics comes from the observation (by early AI researchers) that robots and robotic prostheses were necessarily cybernetic systems, i.e. dynamic control systems that interacted with their environment.
Your post advocates a
□ software □ hardware □ cognitive □ two-factor □ other ___________
universal replacement for passwords. Your idea will not work. Here is why it won’t work:
□ It’s too easy to trick users into revealing their credentials
□ It’s too hard to change a credential if it’s stolen
□ It initiates an arms race which will inevitably be won by the attackers
□ Users will not put up with it
□ Server administrators will not put up with it
□ Web browser developers will not put up with it
□ National governments will not put up with it
□ Apple would have to sacrifice their extremely profitable hardware monopoly
□ It cannot coexist with passwords even during a transition period
□ It requires immediate total cooperation from everybody at once
Specifically, your plan fails to account for these human factors:
□ More than one person might use the same computer
□ One person might use more than one computer
□ One person might use more than one type of Web browser
□ People use software that isn’t a Web browser at all
□ Users rapidly learn to ignore security alerts of this type
□ This secret is even easier to guess by brute force than the typical password
□ This secret is even less memorable than the typical password
□ It’s too hard to type something that complicated on a phone keyboard
□ Not everyone can see the difference between red and green
□ Not everyone can make fine motor movements with that level of precision
□ Not everyone has thumbs
and technical obstacles:
□ Clock skew
□ Unreliable servers
□ Network latency
□ Wireless eavesdropping and jamming
□ Zooko’s Triangle
□ Computers do not necessarily have any USB ports
□ SMTP messages are often recoded or discarded in transit
□ SMS messages are trivially forgeable by anyone with a PBX
□ This protocol was shown to be insecure by ________________, ____ years ago
□ This protocol must be implemented perfectly or it is insecure
and the following philosophical objections may also apply:
□ It relies on a psychologically unnatural notion of
□ People want to present different facets of their identity in different contexts
□ Not everyone trusts your government
□ Not everyone trusts their own government
□ Who’s going to run this brand new global, always-online directory authority?
□ I should be able to authenticate a local communication without Internet access
□ I should be able to communicate without having met someone in person first
□ Anonymity is vital to robust public debate
To sum up,
□ It’s a decent idea, but I don’t think it will work. Keep trying!
□ This is a terrible idea and you should feel terrible.
□ You are the Russian Mafia and I claim my five pounds.
hat tip to the original
It’s professional-organization management election time again. This is my response to everyone who’s about to send me an invitation to vote for them:
When it comes to ACM and IEEE elections, I am a single-issue voter, and the issue is open access to research. I will vote for you if and only if you make a public statement committing to aggressive pursuit of the following goals within your organization, in decreasing order of priority:
As immediately as practical, begin providing to the general public zero-cost, no-registration, no-strings-attached online access to new publications in your organization’s venues.
Commit to a timetable (which should also be as quickly as practical, but could be somewhat slower than for the above) for opening up your organization’s older publications to zero-cost, no-registration, no-strings-attached online access.
Abandon the practice of requiring authors to assign copyright to your organization; instead, require only a license substantively similar to that requested by USENIX (exclusive publication rights for no longer than 12 months with exception for posting an electronic copy on your own website, nonexclusive right to continue disseminating afterward).
On a definite timetable, revert copyright to all authors who published under the old copyright policy, retaining only the rights requested under the new policy.
Thank you for your consideration.
The ACM held its annual Conference on Computer and Communications Security two weeks ago today in Raleigh, North Carolina. CCS is larger than Oakland and has two presentation tracks; I attended less than half of the talks, and my brain was still completely full afterward. Instead of doing one exhaustive post per day like I did with Oakland, I’m just going to highlight a handful of interesting papers over the course of the entire conference, plus the pre-conference Workshop on Privacy in the Electronic Society.