Missing of 2016_12_01 data? too slow of getting latest data?


#1

this is December 19th I am still not seeing data for 2016_12_01; I want to query some thing out of these tables, but both are not present yet;

httparchive.runs.2016_12_01_pages
httparchive.runs.2016_12_01_requests

I understand this site may be maintained at non profit org ? or similar? I wonder any effort needed to speedup this importing process? need volunteer or something? or do you @igrigorik have any plans to improve that refreshness? If I hope 2016_12_01 data can be at 2nd day, and 2016_12_15 data can be available on 16th? how much work has to be done how far are we from now to there?


#2

in the past (3 months ago) it was like 12 days after each crawling cycle I could see the data, like 2016_09_01 data was available since 2016-09-12, but why nowadays the processing increased to 19 days? is this the trend still growing


#3

/cc @patmeenan for input.

It’s a question of server resources. We crawl the top 500K sites, once with mobile UA and once with desktop UA. Executing that many tests takes a while, and hence the ~12 day lag to process, upload, and archive all the data. I don’t believe we have any plans to expand server resources ATM…

That said, it does appear that recent crawls are starting to take a lot longer, and we need to investigate why that’s the case.


#4

I think there were issues with the 12/1 crawl that make it abnormal (it started on 12/2 and as a result it looks like a bunch of things internally called it the 12/2 crawl which may have broken some of the HAR pipeline).

Mid-July we doubled the number of URLs being tested which extended the testing time but the actual testing should still finish in 12-13 days (which is longer than the 8-9 days it used to take but not 19). There are a few delays in the pipeline after the testing runs that I know of (and probably several that I don’t):

  • The batch re-submits any failed tests 2 more times after the run completes to catch any intermittent failures. From the looks of the graphs there is a decent delay before each one. Re-submitting failed tests immediately to get them queued could eliminate the delay.
  • There may be delays after the crawl before actually doing the database dump.
  • I believe there is a 1-day delay or so from when the dump is created before it is backed up and deleted from local storage.
  • I’m pretty sure that the HA import pulls the archived versions and may have it’s own additional delays.

I’m guessing with some tuning and guaranteed “completed” markers we could eliminate some of the delays (PR’s always welcome).

As far as getting results the next day, that’s not really a goal. Our only real system goal is to make sure a crawl completes before the next one starts. Arguably by using almost the full window we are making best utilization of the limited hardware capacity and I can pretty much guarantee that if more hardware was thrown at it the next day more URLs would be added (originally we wanted to crawl the full 1M URL list but just don’t have the capacity). If we ran the crawl in one day and idled the hardware for the rest of the period it would be a pretty big waste.

For discussion’s sake, here are some rough numbers on what it would take to scale to be able to run the tests in ~ 1 day:

  • The current system consists of 2 2U supermicro Fat Twin servers with 4 servers in each (8 servers total). Fully loaded, each ran around $25k. There are 134 Windows VM’s running - figure $100 each for OS licenses. Grand total for one unit of HA capacity (not including hosting/power/bandwidth charges): ~$63,400
  • To get the 12 day crawl down to 1 day assume it takes 12x the hardware (agents scale linearly and this assumes some level of server scaling): ~$760k total (~$700k additional)

Managing and hosting that much hardware would also involve some pretty substantial recurring charges for personnel, power and CoLo.

And before someone asks about just scaling up/down in the cloud, it is much more flexible but in no way, shape or form cheaper. 1,608 Windows instances x 24 hours of C4.large is roughly $7500 per crawl (not including bandwidth costs).