Number of domains in HTTPArchive


#1

Why are there only about 470K domains, even though the FAQ states that the http archive crawls the Alexa Top 1 Mil?

Along those notes, when crawling, do you use the updated Alexa file from that day? So, if you run a scan on 9/1, do you use the Alexa file from 9/1, or is there a set that you are scanning?


#2

Hi @amirian good questions!

HTTP Archive does use the Alexa Top 1M websites to seed its list of pages to crawl. However, from that list of 1M we only use the first 500K – that’s all we can fit in the 2 week window between crawls! So then why are you seeing 470K instead of 500K? Approximately 30K of the tests fail each crawl. This could be due to an error in WebPageTest or the website itself. Check out https://github.com/HTTPArchive/httparchive/issues/115 to learn more about these errors and how we plan to mitigate them.

Alexa has stopped updating their list of 1M sites. HTTP Archive uses an older snapshot of the list for each of the crawls, so you should expect to see the same URLs and page ranks in all of the crawls this year.

The link above has more info about the Alexa list and some alternatives.


#3

Si mi Sitio web de Anuncios Clasificados no está en el top 1mill de alexa ranks no podré verlo en httparchive?

o qué más se toma en cuenta?


#4

If my Classified Ads Website is not in the top 1mill of alexa ranks I will not be able to see it on httparchive?

Correct, by default only the first 500,000 websites in Alexa’s top 1 million list are included.

HTTP Archive does have a way to add URLs that are not in the top 500k. Go to http://httparchive.org/addsite.php and enter your website’s URL to add it to future crawls. Be advised that this functionality may not work in the future beta site.