I think there were issues with the 12/1 crawl that make it abnormal (it started on 12/2 and as a result it looks like a bunch of things internally called it the 12/2 crawl which may have broken some of the HAR pipeline).
Mid-July we doubled the number of URLs being tested which extended the testing time but the actual testing should still finish in 12-13 days (which is longer than the 8-9 days it used to take but not 19). There are a few delays in the pipeline after the testing runs that I know of (and probably several that I don't):
- The batch re-submits any failed tests 2 more times after the run completes to catch any intermittent failures. From the looks of the graphs there is a decent delay before each one. Re-submitting failed tests immediately to get them queued could eliminate the delay.
- There may be delays after the crawl before actually doing the database dump.
- I believe there is a 1-day delay or so from when the dump is created before it is backed up and deleted from local storage.
- I'm pretty sure that the HA import pulls the archived versions and may have it's own additional delays.
I'm guessing with some tuning and guaranteed "completed" markers we could eliminate some of the delays (PR's always welcome).
As far as getting results the next day, that's not really a goal. Our only real system goal is to make sure a crawl completes before the next one starts. Arguably by using almost the full window we are making best utilization of the limited hardware capacity and I can pretty much guarantee that if more hardware was thrown at it the next day more URLs would be added (originally we wanted to crawl the full 1M URL list but just don't have the capacity). If we ran the crawl in one day and idled the hardware for the rest of the period it would be a pretty big waste.
For discussion's sake, here are some rough numbers on what it would take to scale to be able to run the tests in ~ 1 day:
- The current system consists of 2 2U supermicro Fat Twin servers with 4 servers in each (8 servers total). Fully loaded, each ran around $25k. There are 134 Windows VM's running - figure $100 each for OS licenses. Grand total for one unit of HA capacity (not including hosting/power/bandwidth charges): ~$63,400
- To get the 12 day crawl down to 1 day assume it takes 12x the hardware (agents scale linearly and this assumes some level of server scaling): ~$760k total (~$700k additional)
Managing and hosting that much hardware would also involve some pretty substantial recurring charges for personnel, power and CoLo.
And before someone asks about just scaling up/down in the cloud, it is much more flexible but in no way, shape or form cheaper. 1,608 Windows instances x 24 hours of C4.large is roughly $7500 per crawl (not including bandwidth costs).