Analysis of Cookie Size

In a blog post on dev.to, I shared some insights into the size of cookies across the web using HTTP Archive data. This post shares the SQL queries and data I used to prepare that article.

Cookies are used on a lot of websites - 83.9% of the 5.7 million home pages tracked in the HTTP Archive to be specific. They are essentially a name/value pair set by a server and stored in a client’s browser. Sites can store these cookies by using the Set-Cookie HTTP response header, or via JavaScript ( document.cookie ). On subsequent requests, these cookies are sent to the server in a Cookie HTTP request header.

Creating the Cookies tables for this analysis

In the HTTP Archive, cookies are stored in the requests tables - which are quite large. For June 2020’s mobile dataset this table was 3.29 TB. Querying this table repeatedly would be expensive, so I ran the following to create a smaller table containing all of the Set-Cookie response headers. The resulting table httparchive.scratchspace.2020_06_01_mobile_all_cookies is 32.5GB and contains 109 million cookies.

-- Warning, this query processes 3.29TB of data!!!
-- The output of this query was saved to httparchive.scratchspace.2020_06_01_mobile_all_cookies
-- Please use that instead.
CREATE TEMPORARY FUNCTION getHeaders(payload STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
    var $ = JSON.parse(payload);
    var headers = $.response.headers;

    var cookies=[];
    var j=0;
    for (i in headers) { 
      if (headers[i].name.toLowerCase().indexOf('set-cookie')!= -1) {
        cookies[j] = headers[i].value;
        j++;
      }
   }
   
  try {
    return cookies;
  } catch (e) {
    return [];
  }

  
""";

SELECT 
   page, 
   url, 
   set_cookie
FROM (
  SELECT 
    page, 
    url, 
    getHeaders(payload) AS set_cookies
  FROM 
   `httparchive.requests.2020_06_01_mobile` 
)
CROSS JOIN 
   UNNEST(set_cookies) AS set_cookie

I also used a similar query to extract all the HTTP request cookies for favicon requests. The reason for looking specifically at favicon was that it was the last request made on the page, so all cookies set during that page load would be sent back. The results for this query were saved in httparchive.scratchspace.2020_06_01_mobile_favicon_cookie_request_headers, which is 1GB and contains 3.5 million cookie headers.

-- Warning, this query processes 3.29TB of data!!!
-- The output of this query was saved to httparchive.scratchspace.2020_06_01_mobile_favicon_cookie_request_headers
-- Please use that instead.
CREATE TEMPORARY FUNCTION getHeaders(payload STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
    var $ = JSON.parse(payload);
    var headers = $.request.headers;

    var cookies=[];
    var j=0;
    for (i in headers) { 
      if (headers[i].name.toLowerCase().indexOf('cookie')!= -1) {
        cookies[j] = headers[i].value;
        j++;
      }
   }
   
  try {
    return cookies;
  } catch (e) {
    return [];
  }

  
""";

SELECT 
  page, 
  url, 
  cookie
FROM (
  SELECT 
    page, 
    url, 
    getHeaders(payload) AS cookies
    
  FROM 
   `httparchive.requests.2020_06_01_mobile` 
 -- `httparchive.sample_data.requests_mobile_10k` 
  WHERE url LIKE "%favicon%"
  AND NET.HOST(page) = NET.HOST(url)
)
CROSS JOIN 
   UNNEST(cookies) AS cookie

First vs Third Party Cookies, and their sizes

Third parties account for 79% of all cookies.

First vs Third Party Cookies

The median length of all cookies in the HTTP Archive is 36 bytes as of June 2020. That statistic is consistent across both first and third party cookies. The minimum is just a single byte, usually set by empty Set-Cookie headers (which is likely an error).

Cookie Size First Party Third Party
Min 1 1
Median 36 37
95th Percentile 181 135
99th Percentile 287 248
Max 29,735 8,500

All of these stats were obtained from the following query, which uses UNION ALL to combine an overall summary with first/third party summaries.

SELECT 
    IF (NET.REG_DOMAIN(url) = NET.REG_DOMAIN(page), "First Party", "Third Party") AS party,
       MIN(LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+'))) as Min,
       APPROX_QUANTILES(LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+')), 100)[SAFE_ORDINAL(50)] AS `Median`,      
       APPROX_QUANTILES(LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+')), 100)[SAFE_ORDINAL(95)] AS `Pct95th`,
       APPROX_QUANTILES(LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+')), 100)[SAFE_ORDINAL(99)] AS `Pct99th`,
       MAX(LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+'))) as Max,
       COUNT(*) AS cookies
FROM 
  `httparchive.scratchspace.2020_06_01_mobile_all_cookies` 
WHERE  
  set_cookie NOT LIKE '%bytes were stripped%'
GROUP BY party
UNION ALL 

SELECT 
  "All Cookies" AS party,
      MIN(LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+'))) as Min,
       APPROX_QUANTILES(LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+')), 100)[SAFE_ORDINAL(50)] AS `Median`,      
       APPROX_QUANTILES(LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+')), 100)[SAFE_ORDINAL(95)] AS `Pct95th`,
       APPROX_QUANTILES(LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+')), 100)[SAFE_ORDINAL(99)] AS `Pct99th`,
       MAX(LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+'))) as Max,
       COUNT(*) AS cookies
FROM 
  `httparchive.scratchspace.2020_06_01_mobile_all_cookies` 
WHERE  
  set_cookie NOT LIKE '%bytes were stripped%'

To create the CDF plot, I ran the following query to create a histogram, and then created the CDF based on that data.
Distribution of Set-Cookie sizes

SELECT 
  LENGTH(REGEXP_EXTRACT(set_cookie, r'^[^;]+')) AS cookieLen,
  COUNT(*)
FROM 
  `httparchive.scratchspace.2020_06_01_mobile_all_cookies` 
WHERE  
  set_cookie NOT LIKE '%bytes were stripped%'
GROUP BY cookieLen
ORDER BY cookieLen ASC

A similar query was used to create the CDF plot of request cookies:
Distribution of Request Cookie sizes

SELECT
       LENGTH(cookie) as cookie_length,
       COUNT(*) AS cookies
FROM
  `httparchive.scratchspace.2020_06_01_mobile_favicon_cookie_request_headers`
WHERE  
  cookie NOT LIKE '%bytes were stripped%'
GROUP BY cookie_length
ORDER BY cookie_length ASC

You can also see all the data used for these graphs in this Google Sheet.

2 Likes