Does BigQuery contain HAR Archive or cookies of crawled webpages?

Think I figured it out (and yeah, it’s redirect-specific). The dev tools messages with the real response headers don’t have timestamps and redirects re-use the same request ID. Usually the agent takes care of this automatically by generating artificial ID’s for each step of the redirect path by keeping track of that the current ID is for a given dev-tools request ID. The agent also sorts the events by timestamp which causes a problem for the header events that don’t have a timestamp so they don’t get the automatic request ID remapping.

The dev tools events are in the correct order naturally so I got rid of the sort and it worked in local testing. Waiting for the release to roll out over the next hour to make sure it works correctly.

That means most of the July crawl will have correct headers for the redirect case but it won’t be until August where it’s 100% correct.

2 Likes