Hello people,
I’m trying to classify the user activity of the users in India using the Crux Country dataset across the following categories based on the number of total base pages that
- Hit ‘Direct Origin’ within the country
- Hit ‘Direct Origin’ outside of the country
- Hit ‘CDN’ PoP within the country
- Hit ‘CDN’ PoP outside the country
What I’m getting is very strange. Out of 298415 base pages, I see only 9 base pages that are hitting a CDN PoP within India. That’s 0.004%. I’m looking for some help to both check the correctness of the logic and tuning of this SQL query. Here’s the result -
… And the SQL query
WITH numpages_origin_type_geotype_table as (
WITH page_ip_origintype_geoipdetails_table as (
WITH page_ip_origintype_table as
(WITH page_ip_cdn_table AS (#distinct pages list(298415 rows)
SELECT
DISTINCT requests.page as p, JSON_EXTRACT_SCALAR(payload,"._ip_addr") as ip, JSON_EXTRACT_SCALAR(payload,"._cdn_provider") as cdn
FROM
httparchive.requests.2019_08_01_*
as requests
JOIN
(SELECT
DISTINCT url
FROM
httparchive.pages.2019_08_01_*
as pages
JOIN
chrome-ux-report.country_in.201908
as crux
ON
CONCAT(origin, ‘/’) = url
GROUP BY
url) as pages_crux
ON (pages_crux.url = requests.page) and (pages_crux.url = requests.url))
#origin type addition (298415 rows)
SELECT p, ip, CASE WHEN cdn = ‘’ THEN “Direct Origin” WHEN cdn = ‘null’ THEN “Direct Origin” WHEN cdn iS NULL THEN “Direct Origin” ELSE “CDN Enabled” END as origin_type
FROM (
SELECT p, ip, cdn from page_ip_cdn_table))
#Geo details addition (298415 rows)
SELECT p, origin_type, NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(ip)) clientIpNum, TRUNC(NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(ip))/(256*256)) classB
FROM page_ip_origintype_table)
#Geo_type addition (208030 rows)
SELECT p, origin_type, CASE WHEN (countryLabel = “India”) THEN “Within India” ELSE “Outside of India” END as geo_type
FROM page_ip_origintype_geoipdetails_table
JOIN fh-bigquery.geocode.geolite_city_bq_b2b
as geotable
USING (classB)
WHERE clientIpNum BETWEEN startIpNum AND endIpNum)
#no of pages based on origin and geo
SELECT COUNT§ as num_pages, origin_type, geo_type
FROM numpages_origin_type_geotype_table
GROUP BY origin_type, geo_type
ORDER BY num_pages DESC;