How many base pages use origins/CDNs within the country borders?

Hello people,

I’m trying to classify the user activity of the users in India using the Crux Country dataset across the following categories based on the number of total base pages that

  1. Hit ‘Direct Origin’ within the country
  2. Hit ‘Direct Origin’ outside of the country
  3. Hit ‘CDN’ PoP within the country
  4. Hit ‘CDN’ PoP outside the country

What I’m getting is very strange. Out of 298415 base pages, I see only 9 base pages that are hitting a CDN PoP within India. That’s 0.004%. I’m looking for some help to both check the correctness of the logic and tuning of this SQL query. Here’s the result -

08%20PM

… And the SQL query

WITH numpages_origin_type_geotype_table as (
WITH page_ip_origintype_geoipdetails_table as (
WITH page_ip_origintype_table as
(WITH page_ip_cdn_table AS (#distinct pages list(298415 rows)
SELECT
DISTINCT requests.page as p, JSON_EXTRACT_SCALAR(payload,"._ip_addr") as ip, JSON_EXTRACT_SCALAR(payload,"._cdn_provider") as cdn

FROM
httparchive.requests.2019_08_01_* as requests
JOIN
(SELECT
DISTINCT url
FROM
httparchive.pages.2019_08_01_* as pages
JOIN
chrome-ux-report.country_in.201908 as crux
ON
CONCAT(origin, ‘/’) = url
GROUP BY
url) as pages_crux

ON (pages_crux.url = requests.page) and (pages_crux.url = requests.url))

#origin type addition (298415 rows)
SELECT p, ip, CASE WHEN cdn = ‘’ THEN “Direct Origin” WHEN cdn = ‘null’ THEN “Direct Origin” WHEN cdn iS NULL THEN “Direct Origin” ELSE “CDN Enabled” END as origin_type
FROM (
SELECT p, ip, cdn from page_ip_cdn_table))

#Geo details addition (298415 rows)
SELECT p, origin_type, NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(ip)) clientIpNum, TRUNC(NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(ip))/(256*256)) classB
FROM page_ip_origintype_table)

#Geo_type addition (208030 rows)
SELECT p, origin_type, CASE WHEN (countryLabel = “India”) THEN “Within India” ELSE “Outside of India” END as geo_type
FROM page_ip_origintype_geoipdetails_table
JOIN fh-bigquery.geocode.geolite_city_bq_b2b as geotable
USING (classB)
WHERE clientIpNum BETWEEN startIpNum AND endIpNum)

#no of pages based on origin and geo
SELECT COUNT§ as num_pages, origin_type, geo_type
FROM numpages_origin_type_geotype_table
GROUP BY origin_type, geo_type
ORDER BY num_pages DESC;

Hi @saibalaji,

While you can use CrUX to find sites that receive traffic within certain countries, the HTTP Archive data does not provide any synthetic measurements from the context of a user in those countries. All measurements are all run from Redwood City, CA in the US - so ideally the IP addresses you are seeing traffic resolve to will mostly be on the US west coast.

There’s a few other factors at play here that will complicate this:

  • Some CDNs use Anycast, and your IP geolocation database will not provide an accurate result. Traceroutes can be helpful here, but you would need to run them from India.
  • The accuracy of the IP Geolocation database can limit some of your results. The geolite_city_bq_b2b table you are querying was last updated in 2013…

Paul

2 Likes

This really helps! Thanks for the clarification.

1 Like