Hi,
Do we have also HAR Archive of crawled websites or stored cookies while crawling? I didn’t find any table or column in BigQuery. Am I missing something or don’t we have such data?
Bests,
Nurullah
Hi,
Do we have also HAR Archive of crawled websites or stored cookies while crawling? I didn’t find any table or column in BigQuery. Am I missing something or don’t we have such data?
Bests,
Nurullah
The requests
dataset contains the request/response HAR payload for each request. For example the JSON path $.response.headers
may include Set-Cookie
headers. I’m looking for examples to show, but finding results like {name: "Set-Cookie", value: "[110 bytes were stripped]"}
which aren’t very useful.
I found a related comment:
@patmeenan is there still an issue with capturing cookies in Chrome? This would be really detrimental to the Cookies chapter we’re planning for the Web Almanac.
Cookie data should be there: https://www.webpagetest.org/result/200703_E4_2c49cd0de79b2757c89cbaf7557fc65d/1/details/#step1_request1
Set-Cookie: o=aa59c073dffeb41ea8a5e2e4f838d0fa9838da75; expires=Sat, 03-Jul-2021 01:10:11 GMT; Max-Age=31536000; path=/
There may be some edge cases where the request was only available from netlog but they should be rare and far between.
I’m seeing ~30% of cookies having stripped bytes:
CREATE TEMPORARY FUNCTION extractHeader(payload STRING, name STRING)
RETURNS STRING LANGUAGE js AS '''
try {
var $ = JSON.parse(payload);
var header = $._headers.response.find(h => h.toLowerCase().startsWith(name.toLowerCase()));
if (!header) {
return null;
}
return header.substr(header.indexOf(':') + 1).trim();
} catch (e) {
return null;
}
''';
SELECT
COUNTIF(REGEXP_CONTAINS(cookie, r'bytes were stripped')) AS stripped_bytes,
COUNT(0) AS total_reqs,
COUNTIF(REGEXP_CONTAINS(cookie, r'bytes were stripped')) / COUNT(0) AS pct_stripped_bytes
FROM (
SELECT
extractHeader(payload, 'Set-Cookie') AS cookie
FROM
`httparchive.requests.2020_06_01_mobile`)
WHERE
cookie IS NOT NULL
stripped_bytes | total_reqs | pct_stripped_bytes |
---|---|---|
19,671,309 | 68,839,626 | 28.58% |
Any chance you can provide a few page URLs (and the request URLs) where it is happening? If I can reproduce it I should be able to fix it.
Here’s a random sample of 20 instances:
Looks like it might be redirects that drop cookies that don’t get matched up. I’ll take a look over the weekend and see if I can track it down.
Thanks Pat! No need to work over the holiday weekend, it can wait until next week
Think I figured it out (and yeah, it’s redirect-specific). The dev tools messages with the real response headers don’t have timestamps and redirects re-use the same request ID. Usually the agent takes care of this automatically by generating artificial ID’s for each step of the redirect path by keeping track of that the current ID is for a given dev-tools request ID. The agent also sorts the events by timestamp which causes a problem for the header events that don’t have a timestamp so they don’t get the automatic request ID remapping.
The dev tools events are in the correct order naturally so I got rid of the sort and it worked in local testing. Waiting for the release to roll out over the next hour to make sure it works correctly.
That means most of the July crawl will have correct headers for the redirect case but it won’t be until August where it’s 100% correct.
Yep, that appears to have done it: https://www.webpagetest.org/result/200705_XS_0e34853480e95f3ef066c06d9e49791b/1/details/#step1_request45
Hopefully there aren’t any unintended side-effects.
Thanks for fixing @patmeenan
Could it be that issue is still present @rviscomi @patmeenan ? In the latest requests desktop table, some cookie value are stripped. We were hoping to analyze cookies in the privacy chapter of the Web Almanac 2021.
I reran the query from earlier in the thread against May 2021 data and the results show only 1.5% of cookies with stripped bytes. So you’re right that there’s still some data loss but it’s a much better situation than last year.
@patmeenan is this about what you’d expect or is it feasible to get down to 0%?
I imagine that resolving the remaining bit will probably require changes to Chrome (and a fair bit of investigation to figure out what is going on).
Anything is possible but I wouldn’t hold out for it soon.