Pulling a site together from lots of little page generators

Since the last time I was seriously considering writing my own static site generator, a whole bunch of people have actually written static site generators, and gee, it’d be nice if I could use one of them and save myself some effort. Trouble is, none of the generators I’ve seen solve my particular, somewhat unusual use case, or if they do, I can’t tell that they do.

I have a bunch of different projects, each of which lives in its own little VCS repository, and you can think of each as being a black box which, when you push the button on the side, spits out one or more HTML documents and resources required by those documents. The site generator needs to take all those black boxes, push all the buttons, gather up the results, apply overarching site style, and glue everything together into a coherent URL tree. So imagine a source tree that looks something like this:

index.md
robots.txt
scratchpad/
    foo.html
    bar.gif
header-survey/
    tblgen.py
    [other stuff]
rngstats/
    process.R
    [other stuff]
...

and we want that to become a rendered document tree looking something like this:

index.html
robots.txt
scratchpad/
    foo.html
    bar.gif
header-survey/
    index.html
    sprite.png
rngstats/
    index.html
    d3.js
    aes.csv
    arc4.csv
    ...

Notice that each black box is a program written in an arbitrary language, with an arbitrary name (tblgen.py, process.R). In practice I suspect most of them will be written in (Numeric) Python, especially as the real rngstats just hit a brick wall named R doesn’t appear to support reading 64-bit integers out of an HDF file, but I want the option of using something else if it’s convenient for that particular project. I’m prepared to write a certain amount of glue—but only once, not every time I add a new project. This rules out a whole swathe of existing site generators immediately: those with no extension mechanism and those which cannot be persuaded to invoke external programs. More traditional formatting plugins that are stored outside the source root and can do things like make a nice HTML page out of a LaTeX document or a gallery out of a pile of JPEGs are also desirable.

Also, the [other stuff] isn’t supposed to be scanned directly by the site generator. Some of it will be used by the black box (generate.py or generate.R here), some of it will be referenced by the output of the black box, and some of it is irrelevant to the process of packaging up the site and shouldn’t appear in the final result. In fact, I want the site generator to start with index.md and other designated root documents (such as robots.txt) and traverse the tree of links, generating only what is reachable. (With the option to generate a directory index for things like the existing /scratchpad.) I am not aware of anything that can do that.

Second tier requirements are: I want something that’s smart enough to not regenerate things that haven’t changed—some of these black boxes are quite expensive—and that integrates with my VCS of choice (which isn’t necessarily Git) to do so. It needs to understand nested repositories and trigger rebuilds when any of them updates. I also want it to apply minification and pre-compression to everything for which this makes sense, at site build time, so I don’t have to save minified JS or whatever into my source repo. Being able to pull library dependencies from a designated URL at build time might also be nice. Being able to inline a dependency used by a single page into that page would be super nice.

As a final wrinkle, I’m largely unimpressed by all the templating languages out there. Only Genshi that I’m aware of, actually understands HTML structure to the extent that you don’t have to manually specify the proper escaping on each and every substitution; and it seems to be dead upstream and moreover has the XML disease. (Which is how it manages to understand HTML structure to that extent, but surely someone can figure out a way to split the difference…?) I suppose I can live with manual escaping provided I don’t have to write very many templates.

So, what should I be looking at?

Radical simplification will not save the world

Seems like every time I go to a security conference these days there’s at least one short talk where people are proposing to start over and rebuild the computer universe from scratch and make it simple and impossible to use wrong this time and it will be so awesome. Readers, it’s not going to work. And it’s not just a case of nobody’s going to put in enough time and effort to make it work. The idea is doomed from eight o’clock, Day One.

We all know from practical experience that a software module that’s too complicated is likely to harbor internal bugs and is also likely to induce bugs in the code that uses it. But we should also know from practice that a software module that’s too simple may work perfectly itself but will also induce bugs in the code that uses it! One size fits all APIs are almost always too inflexible, and so accumulate a scar tissue of workarounds, which are liable to be buggy. Is this an accident of our human fallibility? No, it is an inevitable consequence of oversimplification.

To explain why this is so, I need to talk a little about cybernetics. In casual usage, this word is a sloppy synonym for robotics and robotic enhancements to biological life (cyborgs), but as a scientific discipline it is the study of dynamic control systems that interact with their environment, ranging in scale from a simple closed-loop feedback controller to entire societies.1 The Wikipedia article is decent, and if you want more detail, the essay Cybernetics of Society is a good starting point. Much of the literature on cybernetics talks about interacting systems of people—firms, governments, social clubs, families, etc—but is equally applicable to systems of, around, or within computers. One of the fundamental conclusions of cybernetics, evident for instance in Stafford Beer’s viable system model, is that a working system must be as least as complex as the systems it interacts with. If it isn’t, it will be unable to cope with all possible inputs. This is a theoretical explanation for the practical observation above, and it lets us put a lower bound on the complexity of a real-world computer system.

Let’s just look at one external phenomenon nearly every computer has to handle: time. Time seems like it ought to be an easy problem. Everyone on Earth could, in principle, agree on what time it is right now. Making a good clock requires precision engineering, but the hardware people have that covered; a modern $5 wristwatch could have earned you twenty thousand pounds in 1714. And yet the task of converting a count of seconds to a human-readable date and vice versa is so hairy that people write 500-page books about that alone, and the IANA has to maintain a database of time zones that has seen at least nine updates a year every year since 2006. And that’s just one of the things computers have to do with time. And handling time correctly can, in fact, be security-critical.

I could assemble a demonstration like this for many other phenomena whose characteristics are set by the non-computerized world: space, electromagnetic waves, human perceptual and motor abilities, written language, mathematics, etc. etc. (I leave the biggest hairball of all—the global information network—out, because it’s at least nominally in-scope for these radical simplification projects.) Computers have to cope with all of these things in at least some circumstances, and they all interact with each other in at least some circumstances, so the aggregate complexity is even higher than if you consider each one in isolation. And we’re only considering here things that a general-purpose computer has to be able to handle before we can start thinking about what we want to use it for; that’ll bring in all the complexity of the problem domain.

To be clear, I do think that starting over from scratch and taking into account everything we’ve learned about programming language, OS, and network protocol design since 1970 would produce something better than what we have now. But what we got at the end of that effort would not be notably simpler than what we have now, and although it might be harder to write insecure (or just buggy) application code on top of it, it would not be impossible. Furthermore, a design and development process that does not understand and accept this will not produce an improvement over the status quo.

1 The casual-use meaning of cybernetics comes from the observation (by early AI researchers) that robots and robotic prostheses were necessarily cybernetic systems, i.e. dynamic control systems that interacted with their environment.

More Notes on the Cross-Platform Availability of Header Files

You may recall a month and a half ago I posted Notes on the Cross-Platform Availability of Header Files and then promptly had to take most of it down because it was insufficiently researched. Well, the research is ongoing, but I’ve got a shiny new set of results, some high-level conclusions, and several ways Viewers Like You can help!

First, the high-level conclusions:

  • Except perhaps in deeply-embedded environments, all of C89’s library is universally available.
  • Code not intended to run on Windows can also assume most of C99 and much of POSIX. The less-ubiquitous headers from these categories are also the less-useful headers.
  • Code that is intended to run on Windows should only use C89 headers and <stdint.h>. If MSVC 2008 support is required, not even <stdint.h> can be used. (Windows compilers do provide a small handful of POSIX headers, but they do not contain the expected set of declarations!)
  • Many different Unix variants ship a similar set of nonstandard headers. We don’t yet know whether the contents of these headers are reliable cross-platform.
  • There is a large set of obsolete headers that are still widespread but should not be used in new code. This is underdocumented.

The raw data is here: https://github.com/zackw/header-survey/

If you want to help, we need more inventories (especially for OSes further from the beaten path), and I’m also interested in finding a good way to crunch the raw data into something presentable. (I used to have a giant generated HTML table but I gave up on that, it was too big to be readable.) If you are an old beard, there are also places where I’m not entirely sure of my methodology – see the README in the source repo.

Caffeinated owls

Semi-anthropomorphic sketches of six owls, each with a different
        facial expression and labeled with the name of a different
        coffee-related beverage: decaf (asleep), half-caf (awake, but
        not happy about it), regular (a little more awake and still
        not happy about it), Irish coffee (cheerfully buzzed),
        espresso (unable to blink), double espresso (oh dear, it's
        gone all the way to knurd).

Art by Dave Mottram. Found on G+.

A Contest

In honor of the Feast of All Fools, and because if anyone has noticed it, they haven’t told me, I hereby announce that there is a joke in the references of my most recently published paper. Whoever first correctly identifies it will win the right to suggest a joke to be added to my next paper, which is currently in preparation. Post your guesses in the comments; so as not to spoil it for anyone, comments will not be visible until after the contest ends.

One guess per person. Must provide a working email address (or I won’t be able to contact you if you win). Do not suggest a joke now; the winner will be notified of the topic of the upcoming paper, so they can think of something appropriate. Management reserves the right to reject joke suggestions, in which case the next person in line will get a crack at it.

Adria Richards Did Nothing Wrong

Editor’s note, 2022 November: There used to be an angry rant here. I still think Adria Richards did nothing wrong, and I think I was right to say so at the time, but I no longer think the rant itself is contributing much of anything. I suggest you read instead:

If you really want to see the rant, the Internet Archive has preserved a copy.

Notes on the Cross-Platform Availability of Header Files

These header files are guaranteed to be available in a C89 hosted environment. All interesting portability targets nowadays are C89 hosted environments (bare-metal environments are still relevant, but not as portability targets).

assert.h
ctype.h
errno.h
float.h
iso646.h
limits.h
locale.h
math.h
setjmp.h
signal.h
stdarg.h
stddef.h
stdio.h
stdlib.h
string.h
time.h
wchar.h
wctype.h

Beyond C89, interesting portability targets divide into three classes. Complete Unix environments are always compliant with C99 and POSIX.1-2001 nowadays, but not necessarily with all of the optional modules of the latter, nor with any more recent standard. Windows has several different competing C runtimes, some of which offer more C99 support than others, and none of which are at all conformant with POSIX. Finally, the major embedded environments are presently all cut-down versions of a specific identifiable complete Unix or of Windows. Those that are derived from Unix usually have most of the POSIX headers but may be missing a few.

EDIT: Everything after this point in the original version of this post was insufficiently thoroughly researched and may be wrong. Corrected tables will appear Real Soon. If you are interested in helping me with that, please see https://github.com/zackw/header-survey.

On Replacements for Passwords

Your post advocates a

□ software □ hardware □ cognitive □ two-factor □ other ___________

universal replacement for passwords. Your idea will not work. Here is why it won’t work:

□ It’s too easy to trick users into revealing their credentials
□ It’s too hard to change a credential if it’s stolen
□ It initiates an arms race which will inevitably be won by the attackers
□ Users will not put up with it
□ Server administrators will not put up with it
□ Web browser developers will not put up with it
□ National governments will not put up with it
□ Apple would have to sacrifice their extremely profitable hardware monopoly
□ It cannot coexist with passwords even during a transition period
□ It requires immediate total cooperation from everybody at once

Specifically, your plan fails to account for these human factors:

□ More than one person might use the same computer
□ One person might use more than one computer
□ One person might use more than one type of Web browser
□ People use software that isn’t a Web browser at all
□ Users rapidly learn to ignore security alerts of this type
□ This secret is even easier to guess by brute force than the typical password
□ This secret is even less memorable than the typical password
□ It’s too hard to type something that complicated on a phone keyboard
□ Not everyone can see the difference between red and green
□ Not everyone can make fine motor movements with that level of precision
□ Not everyone has thumbs

and technical obstacles:

□ Clock skew
□ Unreliable servers
□ Network latency
□ Wireless eavesdropping and jamming
□ Zooko’s Triangle
□ Computers do not necessarily have any USB ports
□ SMTP messages are often recoded or discarded in transit
□ SMS messages are trivially forgeable by anyone with a PBX
□ This protocol was shown to be insecure by ________________, ____ years ago
□ This protocol must be implemented perfectly or it is insecure

and the following philosophical objections may also apply:

□ It relies on a psychologically unnatural notion of trustworthiness
□ People want to present different facets of their identity in different contexts
□ Not everyone trusts your government
□ Not everyone trusts their own government
□ Who’s going to run this brand new global, always-online directory authority?
□ I should be able to authenticate a local communication without Internet access
□ I should be able to communicate without having met someone in person first
□ Anonymity is vital to robust public debate

To sum up,

□ It’s a decent idea, but I don’t think it will work. Keep trying!
□ This is a terrible idea and you should feel terrible.
□ You are the Russian Mafia and I claim my five pounds.

hat tip to the original

In case those were real questions rather than spam vehicles,

my answers may be found on the Contact page, under Answers to Frequent Rhetorical Questions.

Dear Everyone Running for ACM or IEEE Management

It’s professional-organization management election time again. This is my response to everyone who’s about to send me an invitation to vote for them:

When it comes to ACM and IEEE elections, I am a single-issue voter, and the issue is open access to research. I will vote for you if and only if you make a public statement committing to aggressive pursuit of the following goals within your organization, in decreasing order of priority:

  1. As immediately as practical, begin providing to the general public zero-cost, no-registration, no-strings-attached online access to new publications in your organization’s venues.

  2. Commit to a timetable (which should also be as quickly as practical, but could be somewhat slower than for the above) for opening up your organization’s older publications to zero-cost, no-registration, no-strings-attached online access.

  3. Abandon the practice of requiring authors to assign copyright to your organization; instead, require only a license substantively similar to that requested by USENIX (exclusive publication rights for no longer than 12 months with exception for posting an electronic copy on your own website, nonexclusive right to continue disseminating afterward).

  4. On a definite timetable, revert copyright to all authors who published under the old copyright policy, retaining only the rights requested under the new policy.

Thank you for your consideration.