URL Canonicalizer

for the sake of argument
that you had a giant list
of partial URLs
(you know, like www.example.com/blurf)
and you needed to canonicalize them
and chase redirects
and remove duplicates
and dead sites
and further you were aware
that this is much harder than it might sound
not to mention
that many websites do not like urllib
well then
you might be looking for this program
which was written by me
with a little help from serge broslavsky.

Responses to “URL Canonicalizer”

  1. monk.e.boy

    I used urllib and urllib2 for hundreds of thousands of sites and it was fine on all of them (millions of URLs)…

    We did a lot of spidering.

    1. Zack Weinberg

      Huh. I spidered less than 500 sites and something like 10% of them blocked me until I started spoofing Firefox’s user-agent.