This site should no longer be causing certain versions of Firefox (particularly on Mac) to crash. If it still crashes for you, please flush your browser cache and try again. If it still crashes, please let me know about it.

As an unfortunate side effect of the changes required, there is no longer an owl at the bottom of each page. I’d appreciate advice on how to put it back. The trouble is persuading it to be at the bottom of the rightmost sidebar, but only if there is enough space below the actual content—formerly this was dealt with by replicating the background color on the <body> into the content elements for the sidebar, but now it’s all background images and there are visible seams if I do it that way. Note that body::after is already in use for something else, html::after can’t AFAIK be given the desired horizontal alignment, and (again AFAIK) media queries cannot measure the height of the page, only the window; so that excludes any number of more obvious techniques.

(If you mention Flexbox I will make the sad face at you.)

If you get this error message, the Internets may lead you to believe that you have no option but to change magic numbers in the source code and recompile flex. Reader, it is not so. Try the -Ca option, before doing anything else.

No, I don’t know why an option that’s documented to be all about size/speed tradeoffs in the generated, DFA, scanner also has the effect of raising the hard limit on the number of NFA states (from 32000 to about 231), but I already feel dirty just having looked at the code enough to discover this, so I’m going to stop digging while I’m ahead.

Alexa publishes a list of “the top 1,000,000 sites on the web.” Here is some trivia about this list (as it was on September 27, 2013):

• No entries contain an URL scheme.
• Only 247 entries contain the string “www.
• Only 13,906 entries contain a path component.
• There are 987,661 unique hostnames and 967,933 unique domains (public suffix + 1).
• If you tack “http://” on the beginning of each entry and “/” on the end (if there wasn’t a path component already), then issue a GET request for that URL and chase HTTP redirects as far as you can (without leaving the site root, unless there was a path component already), you get 916,228 unique URLs.
• Of those 916,228 unique URLs, only 352,951 begin their hostname component with “www.” and only 14,628 are HTTPS.
• 84,769 of the 967,933 domains do not appear anywhere in the list of canonicalized URLs; these either redirected to a different domain or responded with a network or HTTP error.
• 52,139 of those 84,769 domains do respond to a GET request if you tack “www.” on the beginning of the domain name and then proceed as above.
• But only 41,354 new unique URLs are produced; the other 10,785 domains duplicate entries in the earlier set.
• 39,966 of the 41,354 new URLs begin their hostname component with “www.
• 806 of the new URLs are HTTPS.
• Merging the two sets produces 957,582 unique URLs (of which 392,917 begin the hostname with “www.” and 15,434 are HTTPS), 947,474 unique hostnames and 928,816 unique domains.
• 42,734 registration names (that is, the +1 component in a “public suffix + 1” name) appear in more than one public suffix. 11,748 appear in more than two; 5516 in more than three; 526 in more than ten.
• 44,299 of the domains in the original list do not appear in the canonicalized set.
• 5,183 of the domains in the canonicalized set do not appear in the original list.

Today’s exercise in data cleanup was brought to you by the I Can’t Believe This Took Me An Entire Week Foundation. If you ever need to do something similar, this script may be useful.

I’ve never liked Los Angeles. It’s too hot, to begin with, and it’s the Platonic ideal of everything that’s been wrong with American urban planning since Eisenhower (if not longer): strangling on its own traffic yet still car-mad, built where the water isn’t, and smeared over a ludicrous expanse of landscape. In the nearly twenty years since I moved away, all these things have only gotten worse. I only come back out of family obligation (my parents still live here), which doesn’t help.

This time, though, I find that I am enjoying myself regardless. I’m here with my sister, who does like it here, knows fun things to do and people to hang out with. We’re not clear out at the west end of the San Fernando Valley near our parents’ house; we’re in North Hollywood, a surprisingly short subway ride from downtown. (There is a subway now. A heavily used, grungy, practical subway. I can hardly believe it.) People even seem to be building somewhat denser. I was able to walk to the nearest dry cleaners’, which is also hardly believable.

We went to a show at Theatre of NOTE last night, called “Eat the Runt:” billed as black comedy, but really more of a farce, packed full of in-jokes about museums, grantwriting, and the entertainment biz, and with the cast randomly assigned to roles by pulling names out of a hat before each show. It was hilarious, although I wonder how much it depends on those in-jokes.

At the corner of Hollywood and Vine, the Walk of Fame has a special plaque for the Apollo 11 astronauts, shaped like the moon instead of a star, but still with the little brass old-timey TV. (I suppose it was a television broadcast of great significance, although memorializing it as such seems to miss the point.) I am not sure how I have managed never to notice this before.

Tonight, there will be more theater. Tomorrow, there will be the Huntington Library. Monday, back on an airplane.

For the past several weeks a chunk of the news has been all about how the NSA in conjunction with various other US government agencies, defense contractors, telcos, etc. has, for at least seven years and probably longer, been collecting mass quantities of data about the activities of private citizens, both of the USA and of other nations. The data collected was largely what we call traffic analysis data: who talked to whom, where, when, using what mechanism. It was mostly not the actual contents of the conversations, but so much can be deduced from “who talked to whom, when” that this should not reassure you in the slightest. If you haven’t seen the demonstration that just by compiling and correlating membership lists, the British government could have known that Paul Revere would’ve been a good person to ask pointed questions about revolutionary plots in the colonies in 1772, go read that now.

I don’t think it’s safe to assume we know anything about the details of this data collection: especially not the degree of cooperation the government obtained from telcos and other private organizations. There are too many layers of secrecy involved, there’s probably no one who has the complete picture of what the various three-letter agencies were supposed to be doing (let alone what they actually were doing), and there’s too many people trying to bend the narrative in their own preferred direction. However, I also don’t think the details matter all that much at this stage. That the program existed, and was successful enough that the NSA was bragging about it in an internal PowerPoint deck, is enough for the immediate conversation to go forward. (The details may become very important later, though: especially details about who got to make use of the data collected.)

Lots of other people have been writing about why this program is a Bad Thing: Most critically, large traffic-analytic databases are easy to abuse for politically-motivated witch hunts, which can and have occurred in the US in the past, and arguably are now occurring as a reaction to the leaks. One might also be concerned that this makes it harder to pursue other security goals; that it gives other countries an incentive to partition the Internet along national boundaries, harming its resilience; or that it further harms the US’s image abroad, which was already not doing that well; or that the next surveillance program will be even worse if this one isn’t stopped. (Nothing new under the sun: Samuel Warren and Louis Brandeis’ argument in “The Right to Privacy” in 1890 is still as good an explanation as any you’ll find of why the government should not spy on the general public.)

I want to talk about something a little different; I want to talk about why the secrecy of these ubiquitous surveillance programs is at least as harmful to good governance as the programs themselves.

(more…)

Since the last time I was seriously considering writing my own static site generator, a whole bunch of people have actually written static site generators, and gee, it’d be nice if I could use one of them and save myself some effort. Trouble is, none of the generators I’ve seen solve my particular, somewhat unusual use case, or if they do, I can’t tell that they do.

I have a bunch of different projects, each of which lives in its own little VCS repository, and you can think of each as being a black box which, when you push the button on the side, spits out one or more HTML documents and resources required by those documents. The site generator needs to take all those black boxes, push all the buttons, gather up the results, apply overarching site style, and glue everything together into a coherent URL tree. So imagine a source tree that looks something like this:

index.md
robots.txt
foo.html
bar.gif
tblgen.py
[other stuff]
rngstats/
process.R
[other stuff]
...


and we want that to become a rendered document tree looking something like this:

index.html
robots.txt
foo.html
bar.gif
index.html
sprite.png
rngstats/
index.html
d3.js
aes.csv
arc4.csv
...
...


Notice that each “black box” is a program written in an arbitrary language, with an arbitrary name (tblgen.py, process.R). In practice I suspect most of them will be written in (Numeric) Python, especially as the real “rngstats” just hit a brick wall named “R doesn’t appear to support reading 64-bit integers out of an HDF file,” but I want the option of using something else if it’s convenient for that particular project. I’m prepared to write a certain amount of glue—but only once, not every time I add a new project. This rules out a whole swathe of existing site generators immediately: those with no extension mechanism and those which cannot be persuaded to invoke external programs. More traditional formatting plugins that are stored outside the source root and can do things like make a nice HTML page out of a LaTeX document or a gallery out of a pile of JPEGs are also desirable.

Also, the [other stuff] isn’t supposed to be scanned directly by the site generator. Some of it will be used by the black box (generate.py or generate.R here), some of it will be referenced by the output of the black box, and some of it is irrelevant to the process of packaging up the site and shouldn’t appear in the final result. In fact, I want the site generator to start with index.md and other designated root documents (such as robots.txt) and traverse the tree of links, generating only what is reachable. (With the option to generate a directory index for things like the existing /scratchpad.) I am not aware of anything that can do that.

Second tier requirements are: I want something that’s smart enough to not regenerate things that haven’t changed—some of these black boxes are quite expensive—and that integrates with my VCS of choice (which isn’t necessarily Git) to do so. It needs to understand nested repositories and trigger rebuilds when any of them updates. I also want it to apply minification and pre-compression to everything for which this makes sense, at site build time, so I don’t have to save minified JS or whatever into my source repo. Being able to pull library dependencies from a designated URL at build time might also be nice. Being able to inline a dependency used by a single page into that page would be super nice.

As a final wrinkle, I’m largely unimpressed by all the templating languages out there. Only Genshi that I’m aware of, actually understands HTML structure to the extent that you don’t have to manually specify the proper escaping on each and every substitution; and it seems to be dead upstream and moreover has the XML disease. (Which is how it manages to understand HTML structure to that extent, but surely someone can figure out a way to split the difference…?) I suppose I can live with manual escaping provided I don’t have to write very many templates.

So, what should I be looking at?

Seems like every time I go to a security conference these days there’s at least one short talk where people are proposing to start over and rebuild the computer universe from scratch and make it simple and impossible to use wrong this time and it will be so awesome. Readers, it’s not going to work. And it’s not just a case of nobody’s going to put in enough time and effort to make it work. The idea is doomed from eight o’clock, Day One.

We all know from practical experience that a software module that’s too complicated is likely to harbor internal bugs and is also likely to induce bugs in the code that uses it. But we should also know from practice that a software module that’s too simple may work perfectly itself but will also induce bugs in the code that uses it! “One size fits all” APIs are almost always too inflexible, and so accumulate a “scar tissue” of workarounds, which are liable to be buggy. Is this an accident of our human fallibility? No, it is an inevitable consequence of oversimplification.

To explain why this is so, I need to talk a little about cybernetics. In casual usage, this word is a sloppy synonym for robotics and robotic enhancements to biological life (cyborgs), but as a scientific discipline it is the study of dynamic control systems that interact with their environment, ranging in scale from a simple closed-loop feedback controller to entire societies.1 The Wikipedia article is decent, and if you want more detail, the essay “Cybernetics of Society” is a good starting point. Much of the literature on cybernetics talks about interacting systems of people—firms, governments, social clubs, families, etc—but is equally applicable to systems of, around, or within computers. One of the fundamental conclusions of cybernetics, evident for instance in Stafford Beer’s viable system model, is that a working system must be as least as complex as the systems it interacts with. If it isn’t, it will be unable to cope with all possible inputs. This is a theoretical explanation for the practical observation above, and it lets us put a lower bound on the complexity of a real-world computer system.

Let’s just look at one external phenomenon nearly every computer has to handle: time. Time seems like it ought to be an easy problem. Everyone on Earth could, in principle, agree on what time it is right now. Making a good clock requires precision engineering, but the hardware people have that covered; a modern \$5 wristwatch could have earned you twenty thousand pounds in 1714. And yet the task of converting a count of seconds to a human-readable date and vice versa is so hairy that people write 500-page books about that alone, and the IANA has to maintain a database of time zones that has seen at least nine updates a year every year since 2006. And that’s just one of the things computers have to do with time. And handling time correctly can, in fact, be security-critical.

I could assemble a demonstration like this for many other phenomena whose characteristics are set by the non-computerized world: space, electromagnetic waves, human perceptual and motor abilities, written language, mathematics, etc. etc. (I leave the biggest hairball of all—the global information network—out, because it’s at least nominally in-scope for these radical simplification projects.) Computers have to cope with all of these things in at least some circumstances, and they all interact with each other in at least some circumstances, so the aggregate complexity is even higher than if you consider each one in isolation. And we’re only considering here things that a general-purpose computer has to be able to handle before we can start thinking about what we want to use it for; that’ll bring in all the complexity of the problem domain.

To be clear, I do think that starting over from scratch and taking into account everything we’ve learned about programming language, OS, and network protocol design since 1970 would produce something better than what we have now. But what we got at the end of that effort would not be notably simpler than what we have now, and although it might be harder to write insecure (or just buggy) application code on top of it, it would not be impossible. Furthermore, a design and development process that does not understand and accept this will not produce an improvement over the status quo.

1 The casual-use meaning of “cybernetics” comes from the observation (by early AI researchers) that robots and robotic prostheses were necessarily cybernetic systems, i.e. dynamic control systems that interacted with their environment.

You may recall a month and a half ago I posted Notes on the Cross-Platform Availability of Header Files and then promptly had to take most of it down because it was insufficiently researched. Well, the research is ongoing, but I’ve got a shiny new set of results, some high-level conclusions, and several ways Viewers Like You can help!

First, the high-level conclusions:

• Except perhaps in deeply-embedded environments, all of C89’s library is universally available.
• Code not intended to run on Windows can also assume most of C99 and much of POSIX. The less-ubiquitous headers from these categories are also the less-useful headers.
• Code that is intended to run on Windows should only use C89 headers and <stdint.h>. If MSVC 2008 support is required, not even <stdint.h> can be used. (Windows compilers do provide a small handful of POSIX headers, but they do not contain the expected set of declarations!)
• Many different Unix variants ship a similar set of nonstandard headers. We don’t yet know whether the contents of these headers are reliable cross-platform.
• There is a large set of obsolete headers that are still widespread but should not be used in new code. This is underdocumented.

The full results may be seen here: http://hacks.owlfolio.org/header-survey/
The raw data is here: https://github.com/zackw/header-survey/

If you want to help, we need more inventories (especially for OSes further from the beaten path), and I’m also very interested in improvements to the giant generated HTML table. Y’all on Planet Mozilla can probably tell I’m not a Web designer. If you are an old beard, there are also places where I’m not entirely sure of my methodology – see the README in the source repo.

Art by Dave Mottram. Found on G+.

In honor of the Feast of All Fools, and because if anyone has noticed it, they haven’t told me, I hereby announce that there is a joke in the references of my most recently published paper. Whoever first correctly identifies it will win the right to suggest a joke to be added to my next paper, which is currently in preparation. Post your guesses in the comments; so as not to spoil it for anyone, comments will not be visible until after the contest ends.

One guess per person. Must provide a working email address (or I won’t be able to contact you if you win). Do not suggest a joke now; the winner will be notified of the topic of the upcoming paper, so they can think of something appropriate. Management reserves the right to reject joke suggestions, in which case the next person in line will get a crack at it.