Does BigQuery contain HAR Archive or cookies of crawled webpages?

Hi,

Do we have also HAR Archive of crawled websites or stored cookies while crawling? I didn’t find any table or column in BigQuery. Am I missing something or don’t we have such data?

Bests,
Nurullah

1 Like

The requests dataset contains the request/response HAR payload for each request. For example the JSON path $.response.headers may include Set-Cookie headers. I’m looking for examples to show, but finding results like {name: "Set-Cookie", value: "[110 bytes were stripped]"} which aren’t very useful.

I found a related comment:

@patmeenan is there still an issue with capturing cookies in Chrome? This would be really detrimental to the Cookies chapter we’re planning for the Web Almanac.

1 Like

Cookie data should be there: https://www.webpagetest.org/result/200703_E4_2c49cd0de79b2757c89cbaf7557fc65d/1/details/#step1_request1

Set-Cookie: o=aa59c073dffeb41ea8a5e2e4f838d0fa9838da75; expires=Sat, 03-Jul-2021 01:10:11 GMT; Max-Age=31536000; path=/

There may be some edge cases where the request was only available from netlog but they should be rare and far between.

I’m seeing ~30% of cookies having stripped bytes:

CREATE TEMPORARY FUNCTION extractHeader(payload STRING, name STRING)
RETURNS STRING LANGUAGE js AS '''
try {
  var $ = JSON.parse(payload);
  var header = $._headers.response.find(h => h.toLowerCase().startsWith(name.toLowerCase()));
  if (!header) {
    return null;
  }
  return header.substr(header.indexOf(':') + 1).trim();
} catch (e) {
  return null;
}
''';

SELECT
  COUNTIF(REGEXP_CONTAINS(cookie, r'bytes were stripped')) AS stripped_bytes,
  COUNT(0) AS total_reqs,
  COUNTIF(REGEXP_CONTAINS(cookie, r'bytes were stripped')) / COUNT(0) AS pct_stripped_bytes
FROM (
  SELECT
    extractHeader(payload, 'Set-Cookie') AS cookie
  FROM
    `httparchive.requests.2020_06_01_mobile`)
WHERE
  cookie IS NOT NULL
stripped_bytes total_reqs pct_stripped_bytes
19,671,309 68,839,626 28.58%

Any chance you can provide a few page URLs (and the request URLs) where it is happening? If I can reproduce it I should be able to fix it.

Here’s a random sample of 20 instances:

wptid cookie page url
200512_Mx3S_7CYS [159 bytes were stripped] http://www.7-eleven-th.club/ https://d.agkn.com/pixel/6639/?che=1589291970&sk=164751303419001301530&pd=&py=&b2b=&al=&cec=&wmt=&l0=https://idsync.rlcdn.com/379128.gif?partner_uid=164751303419001301530
200507_Mx8S_H6AR [793 bytes were stripped] http://danaghaibasli.over-blog.com/ https://pr-bh.ybp.yahoo.com/sync/rubicon/WT6gDWxlaLKDkisnfI6REMn5EUdSAgOZEtemQ7w0kco?csrc=
200505_Mx68_C6WG [315 bytes were stripped] http://rabour.ir/ https://www.instagram.com/p/BR8DpvjBbU0/media/?size=t
200517_Mx5A_AC30 [138 bytes were stripped] https://chto-posmotret.online/ https://sync.1dmp.io/pixel.gif?cid=3cbc2ec8-1421-4677-89fe-2ac6fc52a09a&pid=w&o=au&cs=1
200507_Mx5F_ANZ5 [138 bytes were stripped] https://m.tut.by/ https://sync.1dmp.io/pixel.gif?cid=3cbc2ec8-1421-4677-89fe-2ac6fc52a09a&pid=w&o=au&cs=1
200518_Mx12_2370 [167 bytes were stripped] https://bloodstained.fandom.com/ https://sync.srv.stackadapt.com/sync?nid=1&gdpr=&gdpr_consent=
200513_Mx7V_FB2N [126 bytes were stripped] https://kahramandabidi.blogspot.com/ https://dpm.demdex.net/ibs:dpid=121998&dpuuid=77c9f707fdb617a7643eb5c81461b7df&redir=https%3A%2F%2Fsync.crwdcntrl.net%2Fmap%2Fc%3D9828%2Ftp%3DADBE%2Ftpid%3D%24{DD_UUID}
200520_Mx3T_7EC1 [160 bytes were stripped] https://www.justjaredjr.com/ https://ssc-cms.33across.com/ps/?ri=102&ru=https%3A%2F%2Fcms-xch-chicago.33across.com%2Fmatch%3Fbidder_id%3D102%26ttl%3D1592594860%26external_user_id%3D77451614-7018-44e3-b4d2-dfd14443fd15
200512_Mx7J_ESBA [94 bytes were stripped] https://podlodka.info/ https://px.adhigh.net/p/cm/sape?u=0100007FB83ABB5E2C02E53D02A36821&bounced=1
200511_Mx5E_AMR0 [82 bytes were stripped] https://lepetitgael.wordpress.com/ https://x.bidswitch.net/sync?ssp=the33across&custom_data=&gdpr=0&gdpr_consent=
200518_Mx3E_6RPG [148 bytes were stripped] https://primbon-mimpigigi.blogspot.com/ https://loadm.exelator.com/load/?p=204&g=260&buid=4dc5ce95d2d8f1e8d4a23131adaeed54&j=0&xl8blockcheck=1
200511_Mx5D_AK0T [236 bytes were stripped] http://gudangilmu79.blogspot.com/ https://id.rlcdn.com/466606.gif?cparams=google_push%3DAQvitUJPO4KUr69UETlaQWrvD8FPkB3l8iueH9ZT1TKth8EE1WSspjG1FYnXnXFbxa5k4hiuoaqoFtFZ4WDGjSAMty5MLsuRS-yK&google_gid=CAESEEi11PyUMIlcWnnZE2fVp7w&google_cver=1
200510_Mx8C_GCJS [143 bytes were stripped] https://www.plink.com.co/ http://cacerts.digicert.com/DigiCertSHA2SecureServerCA.crt
200518_Mx3Y_7Q1W [111 bytes were stripped] https://westland-survival.fandom.com/ https://nep.advangelists.com/xp/user-sync?acctid=319&redirect=https%3A%2F%2Fcms-xch-chicago.33across.com%2Fmatch%3Fbidder_id%3D100%26external_user_id%3D{PARTNER_VISITOR_ID}
200520_Mx16_2AN1 [138 bytes were stripped] https://www.digitaltrends.com/ https://pixel-eu.rubiconproject.com/exchange/sync.php?p=sovrn-onscroll&gdpr=0&gdpr_consent=
200517_Mx5K_AY6Z [315 bytes were stripped] http://rumfanatic.pl/ https://www.instagram.com/p/B_kmYh1Fdt-/media/?size=t
200515_Mx21_3ZK6 [68 bytes were stripped] https://anikuribon.net/ https://image6.pubmatic.com/AdServer/UCookieSetPug?oid=1&rd=https%3A%2F%2Fcm.g.doubleclick.net%2Fpixel%3Fgoogle_nid%3Dpmeb%26google_sc%3D1%26google_hm%3D%23%23B64_16B_PM_UID%26google_redir%3Dhttps%253A%252F%252Fimage8.pubmatic.com%252FAdServer%252FImgSync%253Fsec%253D1%2526p%253D156578%2526mpc%253D4%2526fp%253D1%2526pu%253Dhttps%25253A%25252F%25252Fimage4.pubmatic.com%25252FAdServer%25252FSPug%25253Fp%25253D156578%252526sc%25253D1&google_gid=CAESECp3S7jyaoyvP7ctUJc0xZQ&google_cver=1&google_push=AQvitUKXYkH15dM94gEbd7TC4memHTeWNCCWkUTjiDkzdk-tSHyiThxMqcXgZO5bQRxH8vvAUG63d96CGlaiT3Qy4tJSa8bbxjEZOQ
200517_Mx26_48T3 [131 bytes were stripped] https://voala.org/ https://match.adsrvr.org/track/cmf/generic?ttd_pid=tapad&ttd_tpi=1&ttd_puid=b7c87dc1-9835-11ea-8792-928c11c458fb%2C&gdpr=0&gdpr_consent=
200502_MxF_Y1S [353 bytes were stripped] https://grit.trixstar.com/ https://d.adroll.com/cm/n/out?adroll_fpc=df4d7f8d468d8e6e6e633ef313240586-1588396158332&arrfrr=https%3A%2F%2Fgrit.trixstar.com%2F&advertisable=JTND4B4H4RDJNAW722HN74
200510_Mx84_FWG2 [373 bytes were stripped] https://furniture-ideal.com/ https://www.linkedin.com/px/li_sync?redirect=https%3A%2F%2Fpx.ads.linkedin.com%2Fcollect%3Fv%3D2%26fmt%3Djs%26pid%3D244251%26url%3Dhttps%3A%2F%2Ffurniture-ideal.com%2F%26time%3D1589160084639%26cookiesTest%3Dtrue%26liSync%3Dtrue

Looks like it might be redirects that drop cookies that don’t get matched up. I’ll take a look over the weekend and see if I can track it down.

2 Likes

Thanks Pat! No need to work over the holiday weekend, it can wait until next week :slight_smile:

2 Likes

Think I figured it out (and yeah, it’s redirect-specific). The dev tools messages with the real response headers don’t have timestamps and redirects re-use the same request ID. Usually the agent takes care of this automatically by generating artificial ID’s for each step of the redirect path by keeping track of that the current ID is for a given dev-tools request ID. The agent also sorts the events by timestamp which causes a problem for the header events that don’t have a timestamp so they don’t get the automatic request ID remapping.

The dev tools events are in the correct order naturally so I got rid of the sort and it worked in local testing. Waiting for the release to roll out over the next hour to make sure it works correctly.

That means most of the July crawl will have correct headers for the redirect case but it won’t be until August where it’s 100% correct.

2 Likes

Yep, that appears to have done it: https://www.webpagetest.org/result/200705_XS_0e34853480e95f3ef066c06d9e49791b/1/details/#step1_request45

Hopefully there aren’t any unintended side-effects.

3 Likes

Thanks for fixing @patmeenan :tada:

2 Likes

Could it be that issue is still present @rviscomi @patmeenan ? In the latest requests desktop table, some cookie value are stripped. We were hoping to analyze cookies in the privacy chapter of the Web Almanac 2021.

I reran the query from earlier in the thread against May 2021 data and the results show only 1.5% of cookies with stripped bytes. So you’re right that there’s still some data loss but it’s a much better situation than last year.

@patmeenan is this about what you’d expect or is it feasible to get down to 0%?

I imagine that resolving the remaining bit will probably require changes to Chrome (and a fair bit of investigation to figure out what is going on).

Anything is possible but I wouldn’t hold out for it soon.

1 Like