I was just playing around sessionizing the data for the crawlid since its available to see if we can infer how the average crawl duration was increasing over a period of time. This is what initially it looks like
Started around 4 days per crawl in 2013 and currently takes around 9 days to finish the crawl based on the current data. I think the reason for the increase is the increased number of hosts to be tracked.
To vet that hypothesis given that the mobile page crawls are substantially less lets compare the crawl time for those and I see its roughly the same as desktop
So essentially the crawl duration (as defined by the max time for crawl id minus the min time for the same crawl id) seems independent of the number of hosts to monitor, is that right?
The number of hosts has a huge impact but it largely depends on the test infrastructure. The mobile HA crawl runs on 2 iPhones (changing soon) and the IE desktop crawl runs on 96 VM’s. It also depends on how many of those VM’s are actually online at a given time. For example, at the start of the current crawl, 32 of the VM’s were offline and I fixed it shortly after it started.
Things also change when we add more capabilities or otherwise change the crawl.
The only real guarantee we provide is that the previous crawl will finish before the next one starts (15 day upper bound) though I use the phrase “guarantee” lightly. More like “best effort”.