Research

Notes and essays about the topics I am currently doing academic research on. For the past decade this has been computer and network security, often having something to do with the (ab)use of the Internet for censorship and surveillance.

I Didn’t Learn Unix By Reading All The Manpages

Originally drafted as a thread on hackers.town, after Abbie Normal asked me to expand on a side comment in a discussion of documentation.

There’s a story old Unix beards tell about how they learned Unix. We just read all the manpages, they say, that’s how well written they are, you don’t need to read anything else or take any classes. Maybe also pick up a copy of K&R if you’re a little iffy on C.

I consider myself an old Unix beard, even though I don’t have a beard and I only got into the game in the days of SunOS 4.1, and until quite recently I thought this was how I learned Unix. I did read all the manpages, without any formal coursework, and trained myself up as a programmer to the point where I could get a job in the industry. It took three years of self-study and experimentation, consuming nearly all my free time, and in retrospect I wouldn’t recommend the experience, but, y’know, it worked out, right?

But the thing is, this story completely neglects all the things I’d already learned about computers and programming before I got to college.

Continued…

Open beta for ICLab TagTeam

I’m pleased to announce the open beta test of ICLab’s clearinghouse for data about censored websites. This site will aggregate manual and automated test reports, facilitate more efficient use of automated test resources, and help policy analysts draw conclusions about what gets censored in particular countries.

[EDIT 19 Jan 2021: The clearinghouse had to be taken down almost immediately because no one had time to maintain it. Someday the project it is part of may be continued. Read on for details on what we had and what we aspired to.]

Continued…

Call for Volunteers: Active Geolocation

For the past few months I’ve been working on a research study of active geolocation algorithms. These attempt to determine where in the world a computer is, by measuring how long it takes network messages from that computer to reach other computers in known locations.

In order to test some of these algorithms thoroughly, I need volunteers who are willing to run my measurement software on their computers, and tell me where they are. I’m especially interested in data reported from computers that are not in Europe nor North America, but data from anywhere is useful. Currently, running the software takes a fair bit of technical skill—if you’re not comfortable with the Unix command line, please wait for the friendlier web-based version which is in development.

If you’re interested, please go to https://research.owlfolio.org/active-geo/ for further instructions.

(For legal reasons, you must be at least 18 years old to volunteer.)

(Reproduction and dissemination of this call for volunteers is encouraged.)

Bootstrapping trust in compilers

The other week, an acquaintance of mine was kvetching on Twitter about how the Rust compiler is written in Rust, and so to get started with the language you have to download a binary, and there’s no way to validate it—you could use the binary plus the matching compiler source to recreate the binary, but that doesn’t prove anything, and also if the compiler were really out to get you, you would be screwed the moment you ran the binary.

This is not a new problem, nor is it a Rust-specific problem. I recall having essentially the same issue back in 2000, give or take, with GNAT, the Ada front-end for GCC. It is written in Ada, and (at the time, anyway) not just any Ada compiler would do, you had to have a roughly contemporaneous version of … GNAT. It was especially infuriating compared to the rest of GCC, which (again, at the time) bent over backward to be buildable with any C compiler you could get your hands on, even a traditional one that didn’t support all of the 1989 language standard. But even that is problematic for someone who would rather not trust any machine code they didn’t verify themselves.

One way around the headache is diverse recompilation, in which you compile the same compiler with two different compilers, then recompile it with itself-as-produced-by-each, and compare the results. But this requires you to have two different compilers in the first place. As of this writing there is only one Rust compiler. There aren’t that many complete implementations of C++ out there, either, and you need one of those to build LLVM (which Rust depends on). I think you could devise a compiler virus that could propagate itself via both LLVM and GCC, for instance.

What’s needed, I think, is an independent root of correctness. A software environment built from scratch to be verifiable, maybe even provably correct, and geared specifically to host independent implementations of compilers for popular languages. They need not be terribly good at optimizing, because the only thing you’d ever use them for is to be one side of a diversely-recompiled bootstrap sequence. It has to be a complete and isolated environment, though, because it wouldn’t be impossible to propagate a compiler virus through the operating system kernel, which can see every block of I/O, after all.

And it seems to me that this environment naturally divides into four pieces. First, a tiny virtual machine. I’m thinking a FORTH interpreter, which is small enough that one programmer can code it by hand in assembly language, and having done that, another programmer can audit it by hand. You need multiple implementations of this, so you can check them against each other to guard against malicious lower layers—it could run on the bare metal, maybe, but the bare metal has an awful lot of clever embedded in it these days. But hopefully this is the only thing you need to implement more than once.

Second, you use the FORTH interpreter as the substratum for a more powerful language. If there’s a language in which each program is its own proof of correctness, that would be the obvious choice, but my mental allergy to arrow languages has put me off following that branch of PL research. Lisp is generally a good language to write compilers in, so a small dialect of that would be another obvious choice. (Maybe leave out the call/cc.)

Third, you write compilers in the more powerful language, with both the FORTH interpreter and more conventional execution environments as code-generation targets. These compilers can then be used to compile other stuff to run in the environment, and conversely, you can build arbitrary code within the environment and export it to your more conventional OS.

The fourth and final piece is a way of getting data in and out of the environment. I imagine it as strictly batch-oriented, not interactive at all, simply because that cuts out a huge chunk of complexity from the FORTH interpreter; similarly it does not have any business talking to the network, nor having any notion of time, maybe not even concurrency—most compile jobs are embarrassingly parallel, but again, huge chunk of complexity. What feels not-crazy to me is some sort of trivial file system: ar archive level of trivial, all files write-once, imposed on a linear array of disk blocks.

It is probably also necessary to reinvent Make, or at least some sort of batch job control language.

PETS rump session talk

I spoke briefly at PETS 2014 about which websites are censored in which countries, and what we can learn just from the lists.

another small dispatch from the coalface

For all countries for which Herdict contains enough reports to be credible (concretely, such that the error bars below cover less than 10% of the range), the estimated probability that a webpage will be inaccessible. Vertically sorted by the left edge of the error bar. Further right is worse. I suspect major systemic errors in this data set, but it’s the only data set in town.

a small dispatch from the coalface

category count %
total 5 838 383 100.000
ok 2 212 565 37.897
ok (redirected) 1 999 341 34.245
network or protocol error 798 231 13.672
timeout 412 759 7.070
hostname not found 166 623 2.854
page not found (404/410) 110 241 1.888
forbidden (403) 75 054 1.286
service unavailable (503) 18 648 .319
server error (500) 15 150 .259
bad request (400) 14 397 .247
authentication required (401) 9 199 .158
redirection loop 2 972 .051
proxy error (502/504/52x) 1 845 .032
other HTTP response 1 010 .017
crawler failure 329 .006
syntactically invalid URL 19 .000

Sorry about the non-tabular figures.

Secure channels are like immunization

For a while now, when people ask me how they can improve their websites’ security, I tell them: Start by turning on HTTPS for everything. Run a separate server on port 80 that issues nothing but permanent redirects to the https:// version of the same URL. There’s lots more you can do, but that’s the easy first step. There are a number of common objections to this plan; today I want to talk about the it should be the user’s choice objection, expressed for instance in Google to Gmail customers: You WILL use HTTPS by Robert L. Mitchell. It goes something like this:

Why should I (the operator of the website) assume I know better than each of my users what their security posture should be? Maybe this is a throwaway account, of no great importance to them. Maybe they are on a slow link that is intrinsically hard to eavesdrop upon, so the extra network round-trips involved in setting up a secure channel make the site annoyingly slow for no benefit.

This objection ignores the public health benefits of secure channels. I’d like to make an analogy to immunization, here. If you get vaccinated against the measles (for instance), that’s good for you because you are much less likely to get the disease yourself. But it is also good for everyone who lives near you, because now you can’t infect them either. If enough people in a region are immune, then nobody will get the disease, even if they aren’t immune; this is called herd immunity. Secure channels have similar benefits to the general public—unconditionally securing a website improves security for everyone on the ’net, whether or not they use that website! Here’s why.

Most of the criminals who crack websites don’t care which accounts they gain access to. This surprises people; if you ask users, they often say things like well, nobody would bother breaking into my email / bank account / personal computer, because I’m not a celebrity and I don’t have any money! But the attackers don’t care about that. They break into email accounts so they can send spam; any @gmail.com address is as good as any other. They break into bank accounts so they can commit credit card fraud; any given person’s card is probably only good for US$1000 or so, but multiply that by thousands of cards and you’re talking about real money. They break into PCs so they can run botnets; they don’t care about data stored on the computer, they want the CPU and the network connection. For more on this point, see the paper Folk Models of Home Computer Security by Rick Wash. This is the most important reason why security needs to be unconditional. Accounts may be throwaway to their users, but they are all the same to the attackers.

Often, criminals who crack websites don’t care which websites they gain access to, either. The logic is similar: the legitimate contents of the website are irrelevant. All the attacker wants is to reuse a legitimate site as part of a spamming scheme or to copy the user list, guess the weaker passwords, and try those username+password combinations on more important websites. This is why everyone who has a website, even if it’s tiny and attracts hardly any traffic, needs to worry about its security. This is also why making websites secure improves security for everyone, even if they never intentionally visit that website.

Now, how does HTTPS help with all this? The easiest several ways to break into websites involve snooping on unsecured network traffic to steal user credentials. This is possible even with the common-but-insufficient tactic of sending only the login form over HTTPS, because every insecure HTTP request after login includes a piece of data called a session cookie that can be stolen and used to impersonate the user for most purposes without having to know the user’s password. (It’s often not possible to change the user’s password without also knowing the old password, but that’s about it. If an attacker just wants to send spam, and doesn’t care about maintaining control of the account, a session cookie is good enough.) It’s also possible even if all logged-in users are served only HTTPS, but you get an unsecured page until you login, because then an attacker can modify the unsecured page and make it steal credentials. Only applying channel security to the entire site for everyone, whoever they are, logged in or not, makes this class of attacks go away.

Unconditional use of HTTPS also enables further security improvements. For instance, a site that is exclusively HTTPS can use the Strict-Transport-Security mechanism to put browsers on notice that they should never communicate with it over an insecure channel: this is important because there are turnkey SSL stripping tools that lurk in between a legitimate site and a targeted user and make it look like the site wasn’t HTTPS in the first place. There are subtle differences in the browser’s presentation that a clever human might notice—or you could direct the computer to pay attention, and then it will notice. But this only works, again, if the site is always HTTPS for everyone. Similarly, an always-secured site can mark all of its cookies secure and httponly which cuts off more ways for attackers to steal user credentials. And if a site runs complicated code on the server, exposing that code to the public Internet two different ways (HTTP and HTTPS) enlarges the server’s attack surface. If the only thing on port 80 is a boilerplate try again with HTTPS permanent redirect, this is not an issue. (Bonus points for invalidating session cookies and passwords that just went over the wire in cleartext.)

Finally, I’ll mention that if a site’s users can turn security off, then there’s a per-user toggle switch in the site’s memory banks somewhere, and the site operators can flip that switch off if they want. Or if they have been, shall we say, leaned on. It’s a lot easier for the site operators to stand up to being leaned on if they can say that’s not a thing our code can do.

some trivia about the Alexa 1M

Alexa publishes a list of the top 1,000,000 sites on the web. Here is some trivia about this list (as it was on September 27, 2013):

  • No entries contain an URL scheme.
  • Only 247 entries contain the string www.
  • Only 13,906 entries contain a path component.
  • There are 987,661 unique hostnames and 967,933 unique domains (public suffix + 1).
  • If you tack http:// on the beginning of each entry and / on the end (if there wasn’t a path component already), then issue a GET request for that URL and chase HTTP redirects as far as you can (without leaving the site root, unless there was a path component already), you get 916,228 unique URLs.
  • Of those 916,228 unique URLs, only 352,951 begin their hostname component with www. and only 14,628 are HTTPS.
  • 84,769 of the 967,933 domains do not appear anywhere in the list of canonicalized URLs; these either redirected to a different domain or responded with a network or HTTP error.
  • 52,139 of those 84,769 domains do respond to a GET request if you tack www. on the beginning of the domain name and then proceed as above.
  • But only 41,354 new unique URLs are produced; the other 10,785 domains duplicate entries in the earlier set.
  • 39,966 of the 41,354 new URLs begin their hostname component with www.
  • 806 of the new URLs are HTTPS.
  • Merging the two sets produces 957,582 unique URLs (of which 392,917 begin the hostname with www. and 15,434 are HTTPS), 947,474 unique hostnames and 928,816 unique domains.
  • 42,734 registration names (that is, the +1 component in a public suffix + 1 name) appear in more than one public suffix. 11,748 appear in more than two; 5516 in more than three; 526 in more than ten.
  • 44,299 of the domains in the original list do not appear in the canonicalized set.
  • 5,183 of the domains in the canonicalized set do not appear in the original list.

Today’s exercise in data cleanup was brought to you by the I Can’t Believe This Took Me An Entire Week Foundation. If you ever need to do something similar, this script may be useful.

Institutional secrecy culture is antidemocratic

For the past several weeks a chunk of the news has been all about how the NSA in conjunction with various other US government agencies, defense contractors, telcos, etc. has, for at least seven years and probably longer, been collecting mass quantities of data about the activities of private citizens, both of the USA and of other nations. The data collected was largely what we call traffic analysis data: who talked to whom, where, when, using what mechanism. It was mostly not the actual contents of the conversations, but so much can be deduced from who talked to whom, when that this should not reassure you in the slightest. If you haven’t seen the demonstration that just by compiling and correlating membership lists, the British government could have known that Paul Revere would’ve been a good person to ask pointed questions about revolutionary plots in the colonies in 1772, go read that now.

I don’t think it’s safe to assume we know anything about the details of this data collection: especially not the degree of cooperation the government obtained from telcos and other private organizations. There are too many layers of secrecy involved, there’s probably no one who has the complete picture of what the various three-letter agencies were supposed to be doing (let alone what they actually were doing), and there’s too many people trying to bend the narrative in their own preferred direction. However, I also don’t think the details matter all that much at this stage. That the program existed, and was successful enough that the NSA was bragging about it in an internal PowerPoint deck, is enough for the immediate conversation to go forward. (The details may become very important later, though: especially details about who got to make use of the data collected.)

Lots of other people have been writing about why this program is a Bad Thing: Most critically, large traffic-analytic databases are easy to abuse for politically-motivated witch hunts, which can and have occurred in the US in the past, and arguably are now occurring as a reaction to the leaks. One might also be concerned that this makes it harder to pursue other security goals; that it gives other countries an incentive to partition the Internet along national boundaries, harming its resilience; or that it further harms the US’s image abroad, which was already not doing that well; or that the next surveillance program will be even worse if this one isn’t stopped. (Nothing new under the sun: Samuel Warren and Louis Brandeis’ argument in The Right to Privacy in 1890 is still as good an explanation as any you’ll find of why the government should not spy on the general public.)

I want to talk about something a little different; I want to talk about why the secrecy of these ubiquitous surveillance programs is at least as harmful to good governance as the programs themselves.

Continued…