Does BigQuery contain HAR Archive or cookies of crawled webpages?

Hi,

Do we have also HAR Archive of crawled websites or stored cookies while crawling? I didn’t find any table or column in BigQuery. Am I missing something or don’t we have such data?

Bests,
Nurullah

The requests dataset contains the request/response HAR payload for each request. For example the JSON path $.response.headers may include Set-Cookie headers. I’m looking for examples to show, but finding results like {name: "Set-Cookie", value: "[110 bytes were stripped]"} which aren’t very useful.

I found a related comment:

@patmeenan is there still an issue with capturing cookies in Chrome? This would be really detrimental to the Cookies chapter we’re planning for the Web Almanac.

Cookie data should be there: https://www.webpagetest.org/result/200703_E4_2c49cd0de79b2757c89cbaf7557fc65d/1/details/#step1_request1

Set-Cookie: o=aa59c073dffeb41ea8a5e2e4f838d0fa9838da75; expires=Sat, 03-Jul-2021 01:10:11 GMT; Max-Age=31536000; path=/

There may be some edge cases where the request was only available from netlog but they should be rare and far between.

I’m seeing ~30% of cookies having stripped bytes:

CREATE TEMPORARY FUNCTION extractHeader(payload STRING, name STRING)
RETURNS STRING LANGUAGE js AS '''
try {
  var $ = JSON.parse(payload);
  var header = $._headers.response.find(h => h.toLowerCase().startsWith(name.toLowerCase()));
  if (!header) {
    return null;
  }
  return header.substr(header.indexOf(':') + 1).trim();
} catch (e) {
  return null;
}
''';

SELECT
  COUNTIF(REGEXP_CONTAINS(cookie, r'bytes were stripped')) AS stripped_bytes,
  COUNT(0) AS total_reqs,
  COUNTIF(REGEXP_CONTAINS(cookie, r'bytes were stripped')) / COUNT(0) AS pct_stripped_bytes
FROM (
  SELECT
    extractHeader(payload, 'Set-Cookie') AS cookie
  FROM
    `httparchive.requests.2020_06_01_mobile`)
WHERE
  cookie IS NOT NULL
stripped_bytes total_reqs pct_stripped_bytes
19,671,309 68,839,626 28.58%

Any chance you can provide a few page URLs (and the request URLs) where it is happening? If I can reproduce it I should be able to fix it.

Here’s a random sample of 20 instances:

wptid cookie page url
200512_Mx3S_7CYS [159 bytes were stripped] http://www.7-eleven-th.club/ https://d.agkn.com/pixel/6639/?che=1589291970&sk=164751303419001301530&pd=&py=&b2b=&al=&cec=&wmt=&l0=https://idsync.rlcdn.com/379128.gif?partner_uid=164751303419001301530
200507_Mx8S_H6AR [793 bytes were stripped] http://danaghaibasli.over-blog.com/ https://pr-bh.ybp.yahoo.com/sync/rubicon/WT6gDWxlaLKDkisnfI6REMn5EUdSAgOZEtemQ7w0kco?csrc=
200505_Mx68_C6WG [315 bytes were stripped] http://rabour.ir/ https://www.instagram.com/p/BR8DpvjBbU0/media/?size=t
200517_Mx5A_AC30 [138 bytes were stripped] https://chto-posmotret.online/ https://sync.1dmp.io/pixel.gif?cid=3cbc2ec8-1421-4677-89fe-2ac6fc52a09a&pid=w&o=au&cs=1
200507_Mx5F_ANZ5 [138 bytes were stripped] https://m.tut.by/ https://sync.1dmp.io/pixel.gif?cid=3cbc2ec8-1421-4677-89fe-2ac6fc52a09a&pid=w&o=au&cs=1
200518_Mx12_2370 [167 bytes were stripped] https://bloodstained.fandom.com/ https://sync.srv.stackadapt.com/sync?nid=1&gdpr=&gdpr_consent=
200513_Mx7V_FB2N [126 bytes were stripped] https://kahramandabidi.blogspot.com/ https://dpm.demdex.net/ibs:dpid=121998&dpuuid=77c9f707fdb617a7643eb5c81461b7df&redir=https%3A%2F%2Fsync.crwdcntrl.net%2Fmap%2Fc%3D9828%2Ftp%3DADBE%2Ftpid%3D%24{DD_UUID}
200520_Mx3T_7EC1 [160 bytes were stripped] https://www.justjaredjr.com/ https://ssc-cms.33across.com/ps/?ri=102&ru=https%3A%2F%2Fcms-xch-chicago.33across.com%2Fmatch%3Fbidder_id%3D102%26ttl%3D1592594860%26external_user_id%3D77451614-7018-44e3-b4d2-dfd14443fd15
200512_Mx7J_ESBA [94 bytes were stripped] https://podlodka.info/ https://px.adhigh.net/p/cm/sape?u=0100007FB83ABB5E2C02E53D02A36821&bounced=1
200511_Mx5E_AMR0 [82 bytes were stripped] https://lepetitgael.wordpress.com/ https://x.bidswitch.net/sync?ssp=the33across&custom_data=&gdpr=0&gdpr_consent=
200518_Mx3E_6RPG [148 bytes were stripped] https://primbon-mimpigigi.blogspot.com/ https://loadm.exelator.com/load/?p=204&g=260&buid=4dc5ce95d2d8f1e8d4a23131adaeed54&j=0&xl8blockcheck=1
200511_Mx5D_AK0T [236 bytes were stripped] http://gudangilmu79.blogspot.com/ https://id.rlcdn.com/466606.gif?cparams=google_push%3DAQvitUJPO4KUr69UETlaQWrvD8FPkB3l8iueH9ZT1TKth8EE1WSspjG1FYnXnXFbxa5k4hiuoaqoFtFZ4WDGjSAMty5MLsuRS-yK&google_gid=CAESEEi11PyUMIlcWnnZE2fVp7w&google_cver=1
200510_Mx8C_GCJS [143 bytes were stripped] https://www.plink.com.co/ http://cacerts.digicert.com/DigiCertSHA2SecureServerCA.crt
200518_Mx3Y_7Q1W [111 bytes were stripped] https://westland-survival.fandom.com/ https://nep.advangelists.com/xp/user-sync?acctid=319&redirect=https%3A%2F%2Fcms-xch-chicago.33across.com%2Fmatch%3Fbidder_id%3D100%26external_user_id%3D{PARTNER_VISITOR_ID}
200520_Mx16_2AN1 [138 bytes were stripped] https://www.digitaltrends.com/ https://pixel-eu.rubiconproject.com/exchange/sync.php?p=sovrn-onscroll&gdpr=0&gdpr_consent=
200517_Mx5K_AY6Z [315 bytes were stripped] http://rumfanatic.pl/ https://www.instagram.com/p/B_kmYh1Fdt-/media/?size=t
200515_Mx21_3ZK6 [68 bytes were stripped] https://anikuribon.net/ https://image6.pubmatic.com/AdServer/UCookieSetPug?oid=1&rd=https%3A%2F%2Fcm.g.doubleclick.net%2Fpixel%3Fgoogle_nid%3Dpmeb%26google_sc%3D1%26google_hm%3D%23%23B64_16B_PM_UID%26google_redir%3Dhttps%253A%252F%252Fimage8.pubmatic.com%252FAdServer%252FImgSync%253Fsec%253D1%2526p%253D156578%2526mpc%253D4%2526fp%253D1%2526pu%253Dhttps%25253A%25252F%25252Fimage4.pubmatic.com%25252FAdServer%25252FSPug%25253Fp%25253D156578%252526sc%25253D1&google_gid=CAESECp3S7jyaoyvP7ctUJc0xZQ&google_cver=1&google_push=AQvitUKXYkH15dM94gEbd7TC4memHTeWNCCWkUTjiDkzdk-tSHyiThxMqcXgZO5bQRxH8vvAUG63d96CGlaiT3Qy4tJSa8bbxjEZOQ
200517_Mx26_48T3 [131 bytes were stripped] https://voala.org/ https://match.adsrvr.org/track/cmf/generic?ttd_pid=tapad&ttd_tpi=1&ttd_puid=b7c87dc1-9835-11ea-8792-928c11c458fb%2C&gdpr=0&gdpr_consent=
200502_MxF_Y1S [353 bytes were stripped] https://grit.trixstar.com/ https://d.adroll.com/cm/n/out?adroll_fpc=df4d7f8d468d8e6e6e633ef313240586-1588396158332&arrfrr=https%3A%2F%2Fgrit.trixstar.com%2F&advertisable=JTND4B4H4RDJNAW722HN74
200510_Mx84_FWG2 [373 bytes were stripped] https://furniture-ideal.com/ https://www.linkedin.com/px/li_sync?redirect=https%3A%2F%2Fpx.ads.linkedin.com%2Fcollect%3Fv%3D2%26fmt%3Djs%26pid%3D244251%26url%3Dhttps%3A%2F%2Ffurniture-ideal.com%2F%26time%3D1589160084639%26cookiesTest%3Dtrue%26liSync%3Dtrue

Looks like it might be redirects that drop cookies that don’t get matched up. I’ll take a look over the weekend and see if I can track it down.

1 Like

Thanks Pat! No need to work over the holiday weekend, it can wait until next week :slight_smile:

1 Like