URL Canonicalizer

supposing for the sake of argument that you had a giant list of partial URLs (you know, like www.example.com/blurf) and you needed to canonicalize them and chase redirects and remove duplicates and dead sites and further you were aware that this is much harder than it might sound not to mention that many websites do not like urllib well then you might be looking for this program which was written by me with a little help from serge broslavsky.

Responses to “URL Canonicalizer”

  1. monk.e.boy

    I used urllib and urllib2 for hundreds of thousands of sites and it was fine on all of them (millions of URLs)…

    We did a lot of spidering.

    1. Zack Weinberg

      Huh. I spidered less than 500 sites and something like 10% of them blocked me until I started spoofing Firefox’s user-agent.