In a blog post on dev.to, I shared some insights into the size of cookies across the web using HTTP Archive data. This post shares the SQL queries and data I used to prepare that article.
Cookies are used on a lot of websites - 83.9% of the 5.7 million home pages tracked in the HTTP Archive to be specific. They are essentially a name/value pair set by a server and stored in a client’s browser. Sites can store these cookies by using the Set-Cookie
HTTP response header, or via JavaScript ( document.cookie
). On subsequent requests, these cookies are sent to the server in a Cookie
HTTP request header.
Creating the Cookies tables for this analysis
In the HTTP Archive, cookies are stored in the requests tables - which are quite large. For June 2020’s mobile dataset this table was 3.29 TB. Querying this table repeatedly would be expensive, so I ran the following to create a smaller table containing all of the Set-Cookie
response headers. The resulting table httparchive.scratchspace.2020_06_01_mobile_all_cookies
is 32.5GB and contains 109 million cookies.
-- Warning, this query processes 3.29TB of data!!!
-- The output of this query was saved to httparchive.scratchspace.2020_06_01_mobile_all_cookies
-- Please use that instead.
CREATE TEMPORARY FUNCTION getHeaders(payload STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
var $ = JSON.parse(payload);
var headers = $.response.headers;
var cookies=[];
var j=0;
for (i in headers) {
if (headers[i].name.toLowerCase().indexOf('set-cookie')!= -1) {
cookies[j] = headers[i].value;
j++;
}
}
try {
return cookies;
} catch (e) {
return [];
}
""";
SELECT
page,
url,
set_cookie
FROM (
SELECT
page,
url,
getHeaders(payload) AS set_cookies
FROM
`httparchive.requests.2020_06_01_mobile`
)
CROSS JOIN
UNNEST(set_cookies) AS set_cookie
I also used a similar query to extract all the HTTP request cookies for favicon requests. The reason for looking specifically at favicon was that it was the last request made on the page, so all cookies set during that page load would be sent back. The results for this query were saved in httparchive.scratchspace.2020_06_01_mobile_favicon_cookie_request_headers
, which is 1GB and contains 3.5 million cookie headers.
-- Warning, this query processes 3.29TB of data!!!
-- The output of this query was saved to httparchive.scratchspace.2020_06_01_mobile_favicon_cookie_request_headers
-- Please use that instead.
CREATE TEMPORARY FUNCTION getHeaders(payload STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
var $ = JSON.parse(payload);
var headers = $.request.headers;
var cookies=[];
var j=0;
for (i in headers) {
if (headers[i].name.toLowerCase().indexOf('cookie')!= -1) {
cookies[j] = headers[i].value;
j++;
}
}
try {
return cookies;
} catch (e) {
return [];
}
""";
SELECT
page,
url,
cookie
FROM (
SELECT
page,
url,
getHeaders(payload) AS cookies
FROM
`httparchive.requests.2020_06_01_mobile`
-- `httparchive.sample_data.requests_mobile_10k`
WHERE url LIKE "%favicon%"
AND NET.HOST(page) = NET.HOST(url)
)
CROSS JOIN
UNNEST(cookies) AS cookie
First vs Third Party Cookies, and their sizes
Third parties account for 79% of all cookies.
The median length of all cookies in the HTTP Archive is 36 bytes as of June 2020. That statistic is consistent across both first and third party cookies. The minimum is just a single byte, usually set by empty Set-Cookie headers (which is likely an error).
Cookie Size | First Party | Third Party |
---|---|---|
Min | 1 | 1 |
Median | 36 | 37 |
95th Percentile | 181 | 135 |
99th Percentile | 287 | 248 |
Max | 29,735 | 8,500 |
All of these stats were obtained from the following query, which uses UNION ALL
to combine an overall summary with first/third party summaries.
SELECT
IF (NET.REG_DOMAIN(url) = NET.REG_DOMAIN(page), "First Party", "Third Party") AS party,
MIN(LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+'))) as Min,
APPROX_QUANTILES(LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+')), 100)[SAFE_ORDINAL(50)] AS `Median`,
APPROX_QUANTILES(LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+')), 100)[SAFE_ORDINAL(95)] AS `Pct95th`,
APPROX_QUANTILES(LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+')), 100)[SAFE_ORDINAL(99)] AS `Pct99th`,
MAX(LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+'))) as Max,
COUNT(*) AS cookies
FROM
`httparchive.scratchspace.2020_06_01_mobile_all_cookies`
WHERE
set_cookie NOT LIKE '%bytes were stripped%'
GROUP BY party
UNION ALL
SELECT
"All Cookies" AS party,
MIN(LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+'))) as Min,
APPROX_QUANTILES(LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+')), 100)[SAFE_ORDINAL(50)] AS `Median`,
APPROX_QUANTILES(LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+')), 100)[SAFE_ORDINAL(95)] AS `Pct95th`,
APPROX_QUANTILES(LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+')), 100)[SAFE_ORDINAL(99)] AS `Pct99th`,
MAX(LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+'))) as Max,
COUNT(*) AS cookies
FROM
`httparchive.scratchspace.2020_06_01_mobile_all_cookies`
WHERE
set_cookie NOT LIKE '%bytes were stripped%'
To create the CDF plot, I ran the following query to create a histogram, and then created the CDF based on that data.
SELECT
LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+')) AS cookieLen,
COUNT(*)
FROM
`httparchive.scratchspace.2020_06_01_mobile_all_cookies`
WHERE
set_cookie NOT LIKE '%bytes were stripped%'
GROUP BY cookieLen
ORDER BY cookieLen ASC
A similar query was used to create the CDF plot of request cookies:
SELECT
LENGTH(cookie) as cookie_length,
COUNT(*) AS cookies
FROM
`httparchive.scratchspace.2020_06_01_mobile_favicon_cookie_request_headers`
WHERE
cookie NOT LIKE '%bytes were stripped%'
GROUP BY cookie_length
ORDER BY cookie_length ASC
You can also see all the data used for these graphs in this Google Sheet.