Does BigQuery contain HAR Archive or cookies of crawled webpages?

_nurullah · June 23, 2020, 7:26am

Hi,

Do we have also HAR Archive of crawled websites or stored cookies while crawling? I didn’t find any table or column in BigQuery. Am I missing something or don’t we have such data?

Bests,
Nurullah

rviscomi · July 2, 2020, 11:06pm

The requests dataset contains the request/response HAR payload for each request. For example the JSON path $.response.headers may include Set-Cookie headers. I’m looking for examples to show, but finding results like {name: "Set-Cookie", value: "[110 bytes were stripped]"} which aren’t very useful.

I found a related comment:

@patmeenan is there still an issue with capturing cookies in Chrome? This would be really detrimental to the Cookies chapter we’re planning for the Web Almanac.

patmeenan · July 3, 2020, 1:12am

Cookie data should be there: https://www.webpagetest.org/result/200703_E4_2c49cd0de79b2757c89cbaf7557fc65d/1/details/#step1_request1

Set-Cookie: o=aa59c073dffeb41ea8a5e2e4f838d0fa9838da75; expires=Sat, 03-Jul-2021 01:10:11 GMT; Max-Age=31536000; path=/

There may be some edge cases where the request was only available from netlog but they should be rare and far between.

rviscomi · July 3, 2020, 9:39pm

I’m seeing ~30% of cookies having stripped bytes:

CREATE TEMPORARY FUNCTION extractHeader(payload STRING, name STRING)
RETURNS STRING LANGUAGE js AS '''
try {
  var $ = JSON.parse(payload);
  var header = $._headers.response.find(h => h.toLowerCase().startsWith(name.toLowerCase()));
  if (!header) {
    return null;
  }
  return header.substr(header.indexOf(':') + 1).trim();
} catch (e) {
  return null;
}
''';

SELECT
  COUNTIF(REGEXP_CONTAINS(cookie, r'bytes were stripped')) AS stripped_bytes,
  COUNT(0) AS total_reqs,
  COUNTIF(REGEXP_CONTAINS(cookie, r'bytes were stripped')) / COUNT(0) AS pct_stripped_bytes
FROM (
  SELECT
    extractHeader(payload, 'Set-Cookie') AS cookie
  FROM
    `httparchive.requests.2020_06_01_mobile`)
WHERE
  cookie IS NOT NULL

stripped_bytes	total_reqs	pct_stripped_bytes
19,671,309	68,839,626	28.58%

patmeenan · July 3, 2020, 10:31pm

Any chance you can provide a few page URLs (and the request URLs) where it is happening? If I can reproduce it I should be able to fix it.

rviscomi · July 3, 2020, 10:49pm

Here’s a random sample of 20 instances:

wptid	cookie	page	url
200512_Mx3S_7CYS	[159 bytes were stripped]	http://www.7-eleven-th.club/	https://d.agkn.com/pixel/6639/?che=1589291970&sk=164751303419001301530&pd=&py=&b2b=&al=&cec=&wmt=&l0=https://idsync.rlcdn.com/379128.gif?partner_uid=164751303419001301530
200507_Mx8S_H6AR	[793 bytes were stripped]	http://danaghaibasli.over-blog.com/	https://pr-bh.ybp.yahoo.com/sync/rubicon/WT6gDWxlaLKDkisnfI6REMn5EUdSAgOZEtemQ7w0kco?csrc=
200505_Mx68_C6WG	[315 bytes were stripped]	http://rabour.ir/	https://www.instagram.com/p/BR8DpvjBbU0/media/?size=t
200517_Mx5A_AC30	[138 bytes were stripped]	https://chto-posmotret.online/	https://sync.1dmp.io/pixel.gif?cid=3cbc2ec8-1421-4677-89fe-2ac6fc52a09a&pid=w&o=au&cs=1
200507_Mx5F_ANZ5	[138 bytes were stripped]	https://m.tut.by/	https://sync.1dmp.io/pixel.gif?cid=3cbc2ec8-1421-4677-89fe-2ac6fc52a09a&pid=w&o=au&cs=1
200518_Mx12_2370	[167 bytes were stripped]	https://bloodstained.fandom.com/	https://sync.srv.stackadapt.com/sync?nid=1&gdpr=&gdpr_consent=
200513_Mx7V_FB2N	[126 bytes were stripped]	https://kahramandabidi.blogspot.com/	https://dpm.demdex.net/ibs:dpid=121998&dpuuid=77c9f707fdb617a7643eb5c81461b7df&redir=https%3A%2F%2Fsync.crwdcntrl.net%2Fmap%2Fc%3D9828%2Ftp%3DADBE%2Ftpid%3D%24{DD_UUID}
200520_Mx3T_7EC1	[160 bytes were stripped]	https://www.justjaredjr.com/	https://ssc-cms.33across.com/ps/?ri=102&ru=https%3A%2F%2Fcms-xch-chicago.33across.com%2Fmatch%3Fbidder_id%3D102%26ttl%3D1592594860%26external_user_id%3D77451614-7018-44e3-b4d2-dfd14443fd15
200512_Mx7J_ESBA	[94 bytes were stripped]	https://podlodka.info/	https://px.adhigh.net/p/cm/sape?u=0100007FB83ABB5E2C02E53D02A36821&bounced=1
200511_Mx5E_AMR0	[82 bytes were stripped]	https://lepetitgael.wordpress.com/	https://x.bidswitch.net/sync?ssp=the33across&custom_data=&gdpr=0&gdpr_consent=
200518_Mx3E_6RPG	[148 bytes were stripped]	https://primbon-mimpigigi.blogspot.com/	https://loadm.exelator.com/load/?p=204&g=260&buid=4dc5ce95d2d8f1e8d4a23131adaeed54&j=0&xl8blockcheck=1
200511_Mx5D_AK0T	[236 bytes were stripped]	http://gudangilmu79.blogspot.com/	https://id.rlcdn.com/466606.gif?cparams=google_push%3DAQvitUJPO4KUr69UETlaQWrvD8FPkB3l8iueH9ZT1TKth8EE1WSspjG1FYnXnXFbxa5k4hiuoaqoFtFZ4WDGjSAMty5MLsuRS-yK&google_gid=CAESEEi11PyUMIlcWnnZE2fVp7w&google_cver=1
200510_Mx8C_GCJS	[143 bytes were stripped]	https://www.plink.com.co/	http://cacerts.digicert.com/DigiCertSHA2SecureServerCA.crt
200518_Mx3Y_7Q1W	[111 bytes were stripped]	https://westland-survival.fandom.com/	https://nep.advangelists.com/xp/user-sync?acctid=319&redirect=https%3A%2F%2Fcms-xch-chicago.33across.com%2Fmatch%3Fbidder_id%3D100%26external_user_id%3D{PARTNER_VISITOR_ID}
200520_Mx16_2AN1	[138 bytes were stripped]	https://www.digitaltrends.com/	https://pixel-eu.rubiconproject.com/exchange/sync.php?p=sovrn-onscroll&gdpr=0&gdpr_consent=
200517_Mx5K_AY6Z	[315 bytes were stripped]	http://rumfanatic.pl/	https://www.instagram.com/p/B_kmYh1Fdt-/media/?size=t
200515_Mx21_3ZK6	[68 bytes were stripped]	https://anikuribon.net/	https://image6.pubmatic.com/AdServer/UCookieSetPug?oid=1&rd=https%3A%2F%2Fcm.g.doubleclick.net%2Fpixel%3Fgoogle_nid%3Dpmeb%26google_sc%3D1%26google_hm%3D%23%23B64_16B_PM_UID%26google_redir%3Dhttps%253A%252F%252Fimage8.pubmatic.com%252FAdServer%252FImgSync%253Fsec%253D1%2526p%253D156578%2526mpc%253D4%2526fp%253D1%2526pu%253Dhttps%25253A%25252F%25252Fimage4.pubmatic.com%25252FAdServer%25252FSPug%25253Fp%25253D156578%252526sc%25253D1&google_gid=CAESECp3S7jyaoyvP7ctUJc0xZQ&google_cver=1&google_push=AQvitUKXYkH15dM94gEbd7TC4memHTeWNCCWkUTjiDkzdk-tSHyiThxMqcXgZO5bQRxH8vvAUG63d96CGlaiT3Qy4tJSa8bbxjEZOQ
200517_Mx26_48T3	[131 bytes were stripped]	https://voala.org/	https://match.adsrvr.org/track/cmf/generic?ttd_pid=tapad&ttd_tpi=1&ttd_puid=b7c87dc1-9835-11ea-8792-928c11c458fb%2C&gdpr=0&gdpr_consent=
200502_MxF_Y1S	[353 bytes were stripped]	https://grit.trixstar.com/	https://d.adroll.com/cm/n/out?adroll_fpc=df4d7f8d468d8e6e6e633ef313240586-1588396158332&arrfrr=https%3A%2F%2Fgrit.trixstar.com%2F&advertisable=JTND4B4H4RDJNAW722HN74
200510_Mx84_FWG2	[373 bytes were stripped]	https://furniture-ideal.com/	https://www.linkedin.com/px/li_sync?redirect=https%3A%2F%2Fpx.ads.linkedin.com%2Fcollect%3Fv%3D2%26fmt%3Djs%26pid%3D244251%26url%3Dhttps%3A%2F%2Ffurniture-ideal.com%2F%26time%3D1589160084639%26cookiesTest%3Dtrue%26liSync%3Dtrue

patmeenan · July 3, 2020, 11:18pm

Looks like it might be redirects that drop cookies that don’t get matched up. I’ll take a look over the weekend and see if I can track it down.

rviscomi · July 3, 2020, 11:20pm

Thanks Pat! No need to work over the holiday weekend, it can wait until next week

patmeenan · July 5, 2020, 12:25am

Think I figured it out (and yeah, it’s redirect-specific). The dev tools messages with the real response headers don’t have timestamps and redirects re-use the same request ID. Usually the agent takes care of this automatically by generating artificial ID’s for each step of the redirect path by keeping track of that the current ID is for a given dev-tools request ID. The agent also sorts the events by timestamp which causes a problem for the header events that don’t have a timestamp so they don’t get the automatic request ID remapping.

The dev tools events are in the correct order naturally so I got rid of the sort and it worked in local testing. Waiting for the release to roll out over the next hour to make sure it works correctly.

That means most of the July crawl will have correct headers for the redirect case but it won’t be until August where it’s 100% correct.

patmeenan · July 5, 2020, 12:47am

Yep, that appears to have done it: https://www.webpagetest.org/result/200705_XS_0e34853480e95f3ef066c06d9e49791b/1/details/#step1_request45

Hopefully there aren’t any unintended side-effects.

rviscomi · July 5, 2020, 4:59pm

Thanks for fixing @patmeenan

ydimova · June 11, 2021, 2:55pm

Could it be that issue is still present @rviscomi @patmeenan ? In the latest requests desktop table, some cookie value are stripped. We were hoping to analyze cookies in the privacy chapter of the Web Almanac 2021.

rviscomi · June 11, 2021, 5:49pm

I reran the query from earlier in the thread against May 2021 data and the results show only 1.5% of cookies with stripped bytes. So you’re right that there’s still some data loss but it’s a much better situation than last year.

@patmeenan is this about what you’d expect or is it feasible to get down to 0%?

patmeenan · June 11, 2021, 6:17pm

I imagine that resolving the remaining bit will probably require changes to Chrome (and a fair bit of investigation to figure out what is going on).

Anything is possible but I wouldn’t hold out for it soon.

Topic		Replies	Views
Downloading HAR-Datasets later than May 2022? Meta	4	1035	October 16, 2023
Download .har files? Analysis	1	1969	February 10, 2018
Working with CSV dumps Analysis	14	2197	March 6, 2020
How to download the HTTP Archive data FAQ	0	6764	February 25, 2016
Analysis of Cookie Size Analysis	0	1952	July 13, 2020

Does BigQuery contain HAR Archive or cookies of crawled webpages?

Related topics