some trivia about the Alexa 1M

Alexa publishes a list of the top 1,000,000 sites on the web. Here is some trivia about this list (as it was on September 27, 2013):

  • No entries contain an URL scheme.
  • Only 247 entries contain the string www.
  • Only 13,906 entries contain a path component.
  • There are 987,661 unique hostnames and 967,933 unique domains (public suffix + 1).
  • If you tack http:// on the beginning of each entry and / on the end (if there wasn’t a path component already), then issue a GET request for that URL and chase HTTP redirects as far as you can (without leaving the site root, unless there was a path component already), you get 916,228 unique URLs.
  • Of those 916,228 unique URLs, only 352,951 begin their hostname component with www. and only 14,628 are HTTPS.
  • 84,769 of the 967,933 domains do not appear anywhere in the list of canonicalized URLs; these either redirected to a different domain or responded with a network or HTTP error.
  • 52,139 of those 84,769 domains do respond to a GET request if you tack www. on the beginning of the domain name and then proceed as above.
  • But only 41,354 new unique URLs are produced; the other 10,785 domains duplicate entries in the earlier set.
  • 39,966 of the 41,354 new URLs begin their hostname component with www.
  • 806 of the new URLs are HTTPS.
  • Merging the two sets produces 957,582 unique URLs (of which 392,917 begin the hostname with www. and 15,434 are HTTPS), 947,474 unique hostnames and 928,816 unique domains.
  • 42,734 registration names (that is, the +1 component in a public suffix + 1 name) appear in more than one public suffix. 11,748 appear in more than two; 5516 in more than three; 526 in more than ten.
  • 44,299 of the domains in the original list do not appear in the canonicalized set.
  • 5,183 of the domains in the canonicalized set do not appear in the original list.

Today’s exercise in data cleanup was brought to you by the I Can’t Believe This Took Me An Entire Week Foundation. If you ever need to do something similar, this script may be useful.

Responses to “some trivia about the Alexa 1M”

  1. karl

    Gorgeous set and analysis. Another interesting facts about Alexa data, the list by countries are different with different results.

    I have seen in your script that you chose "Mozilla/5.0 (Macintosh; rv:24.0) Gecko/20100101 Firefox/24.0"

    Note that the results will/might be different depending on your UA string. Such as for example Opera Presto Desktop.

    And even more fun with… ;) Firefox OS vs WebKit IOS vs Opera Mobile for example. It will introduce variability in your results.

  2. karl

    Ah in your script, modify the self._headers to

    self._headers = {
      "User-Agent":
        "Mozilla/5.0 (Macintosh; rv:24.0) Gecko/20100101 Firefox/24.0",
      "Accept":
        "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
    }
      1. karl

        Here an example how it changes things sometimes.

        Accept: */*

        $ http -v HEAD http://www.yahoo.com/ \
          User-Agent:'Mozilla/5.0 (Mobile; rv:18.0) Gecko/18.0 Firefox/18.0'
        
        HEAD / HTTP/1.1
        Accept: */*
        Accept-Encoding: gzip, deflate, compress
        Host: www.yahoo.com
        User-Agent: Mozilla/5.0 (Mobile; rv:18.0) Gecko/18.0 Firefox/18.0
        
        HTTP/1.1 200 OK
        Age: 0
        Cache-Control: private
        Connection: keep-alive
        Content-Encoding: gzip
        Content-Type: text/html; charset=utf-8
        Date: Thu, 03 Oct 2013 00:34:36 GMT
        P3P: policyref="http://info.yahoo.com/w3c/p3p.xml", CP="CAO DSP
             COR CUR ADM DEV TAI PSA PSD IVAi IVDi CONi TELo OTPi OUR DELi SAMi
             OTRi UNRi PU Bi IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL
             HEA PRE LOC GOV"
        Server: YTS/1.20.13
        Set-Cookie: B=70abjgp94pess&b=3&s=v0; expires=Sun, 04-Oct-2015 00:34:36
                    GMT; path=/; domain=.yahoo.com
        Vary: Accept-Encoding
        Via: HTTP/1.1 ir2.fp.bf1.yahoo.com (YahooTrafficServer/1.20.13 [c sSf ])

        Exact same request with Accept: text/html

        $ http -v HEAD http://www.yahoo.com/ \
          User-Agent:'Mozilla/5.0 (Mobile; rv:18.0) Gecko/18.0 Firefox/18.0' \
          Accept:"text/html"
        
        HEAD / HTTP/1.1
        Accept: text/html
        Accept-Encoding: gzip, deflate, compress
        Host: www.yahoo.com
        User-Agent: Mozilla/5.0 (Mobile; rv:18.0) Gecko/18.0 Firefox/18.0
        
        HTTP/1.1 302 Found
        Age: 0
        Cache-Control: private
        Connection: keep-alive
        Content-Type: text/html; charset=utf-8
        Date: Thu, 03 Oct 2013 00:34:30 GMT
        Location: http://ca.yahoo.com/?p=us
        P3P: policyref="http://info.yahoo.com/w3c/p3p.xml", CP="CAO DSP
             COR CUR ADM DEV TAI PSA PSD IVAi IVDi CONi TELo OTPi OUR DELi SAMi
             OTRi UNRi PU Bi IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL
             HEA PRE LOC GOV"
        Server: YTS/1.20.28
        Set-Cookie: B=8ilfp2594pesm&b=3&s=6j; expires=Sun, 04-Oct-2015 00:34:30
                    GMT; path=/; domain=.yahoo.com
        Set-Cookie: DNR=deleted; expires=Wed, 03-Oct-2012 00:34:29 GMT; path=/;
                    domain=.www.yahoo.com
        Set-Cookie: DNR=deleted; expires=Wed, 03-Oct-2012 00:34:29 GMT; path=/;
                    domain=.yahoo.com
        Set-Cookie: fpc=d=1F4LfD_slFvUDE4IrXvEHp4lfCNMCXt0c8DdAn_3GvF06gBtNGC_R
                    lkUciQ_kuxauM90qXdTjBSG8gyTQuFqIPdNCKPfZz3GqV.FTScyQ3X.f0Ec
                    pmNZhDF4tToCw14HJk3nqr830qGGu68TH50VQvhaspjhdkjnya91CiWZxns
                    AOhy_OTs2kqPsdb6GLDittMH5x0U-&v=2;
                    expires=Fri, 03-Oct-2014 00:34:30 GMT; path=/;
                    domain=www.yahoo.com
        Vary: Accept-Encoding
        Via: HTTP/1.1 ir16.fp.bf1.yahoo.com (YahooTrafficServer/1.20.28 [c s f ])
        X-Frame-Options: SAMEORIGIN

        Fun, isn’t it ;)

        1. Zack Weinberg

          Huh, it looks like Yahoo is doing geo-targeting based on the IP of the client, only if the Accept: header is present. For my use case I want to minimize geo-targeting. Of course it could just as easily go the other way…

  3. karl

    Btw, with your current script

    → python canonurls.py -q top-1million-urls.txt > results.txt

    It seems that the results are stored in memory. I know it just text, but would not it be better to manage the queue for also writing the results by batch. It could be every 100 URIs. So you would flush what is in the memory and if you put a tail on the file results.txt you could see what’s coming in.

    1. Zack Weinberg

      I was going to say it has to store the results in memory for deduplication but that’s not actually exclusive with your suggestion. I’ll make that change next time I revise the program. Thanks!