We know the CrUX report already split the datasets based on the country. But the origins in that country datasets doesn’t mean that the sites are local and developed by local developers. So to understand the state of the web in a country, we’re trying to find a way on identifying which sites are actually developed by local developers. And below are 4 signals that we’re using to identify the local websites in my country Indonesia. Thanks for @rviscomi and @paulcalvano for helping on ip location, and language attribute signals.
Identifying based on country suffix domain.
Indonesia has .id suffix for domain. The domain only available for Indonesian because to register it you need to provide a country legal document. We’re using CrUX country dataset for this signal and find all origins in a latest CrUX country dataset. For example below are the query that using January 2019 dataset.
#standardSQL
SELECT DISTINCT origin FROM `chrome-ux-report.country_id.201901` WHERE origin LIKE '%.id'
Using the query above, we can get 19,580 or 11.6% origins from 168,591 origins in chrome-ux-report.country_id.201901 dataset. The country domain with id suffix signal is the strongest signal that identifies a website must be an Indonesia website because there is no way to buy an .id suffix domain without government legal document.
Identifying based on server location
We’re assuming if the website’s servers are located in a country where the country is not a data center for cloud services than it must be deployed in that country because it’s required to deployed in that country(Indonesia has a regulation for a website that has financial data, the data server must be deployed in Indonesia), or to be closer with it’s users.
Following the discussion with @rviscomi and @paulcalvano we come with the query
#standardSQL
SELECT page, count(*)
FROM (
SELECT page, url,
INTEGER(PARSE_IP(REGEXP_REPLACE(JSON_EXTRACT(payload, "$._ip_addr"), r"([\"]+)",""))) AS clientIpNum,
INTEGER(PARSE_IP(REGEXP_REPLACE(JSON_EXTRACT(payload, "$._ip_addr"), r"([\"]+)",""))/(256*256)) AS classB
FROM [httparchive:latest.requests_mobile]
WHERE JSON_EXTRACT(payload, "$._final_base_page") = "true"
) AS a
JOIN EACH [fh-bigquery:geocode.geolite_city_bq_b2b] AS b
ON a.classB = b.classB
JOIN EACH (SELECT CONCAT(origin, '/') AS origin FROM [chrome-ux-report:country_id.201901]) crux
ON url = origin
WHERE a.clientIpNum BETWEEN b.startIpNum AND b.endIpNum AND countryLabel = "Indonesia"
GROUP BY page
The query above produce 9,123 origins or 5.4%
If we combine the query from suffix and ip based signals we got 24,340 total origins and if we cross the data from both signals there are 4,363 origins from ip location signals that also with .id suffix domain.
Identifying based on language attribute
There are also a possibility to use language attribute in HTML body response like @rviscomi explain here. Using the language attribute with the query below, we can detect 19,506 origins from
#standardSQL
SELECT
DISTINCT origin
FROM
`httparchive.response_bodies.2019_01_01_mobile`
JOIN
`chrome-ux-report.country_id.201901`
ON
CONCAT(origin, '/') = page
WHERE REGEXP_EXTRACT(body, '(?i)<html[^>]*lang=[\'"]?([a-z]{2})')='id'
Using the query above, from 19,506 origins only 518 origins that exist in both .id suffix based signal and ip location based signal. There are 2,172 origins from above query that also exist in .id suffix based signal, and there are 1,291 origins from query above that also exist in ip location based signal.
Identifying based on content language
We also assume if a website provided a content in a specific language, it could be also developed by Indonesian. So to enable us identifying content in a specific language which in this case in Bahasa Indonesia using big query, we’re using the stop words detection. HTTP archive response dataset provided us with body responses string data of the websites. We’re using the stop words list that we were found in Github that contains 758 stop words from Bahasa Indonesia.
Before able to detect the content based on the HTML body response, we need to tokenized first the string response and store it in temporary table. Count the number of stop words directly in string response will hit the memory limit of big query, so it’s not possible to run the SQL query directly without temporary table. Below is the query to tokenize the string response from HTTP Archive dataset.
#standardSQL
SELECT url, SPLIT(REPLACE(REPLACE(resp.body,".",""),",","")," ") AS tokens
FROM `httparchive.response_bodies.2019_02_01_mobile` AS resp
LEFT JOIN `chrome-ux-report.country_id.201901` AS crux
ON resp.url = CONCAT(crux.origin,'/')
WHERE resp.url=resp.page
Then we run a query to detect how many stop words exist per origin with this query
#standardSQL
SELECT url, count(tokens) as found
FROM `ecosystem-activation.indonesia.body_tokenized` CROSS JOIN UNNEST(tokens) as token
WHERE token IN (SELECT word FROM `ecosystem-activation.indonesia.stop_words`)
GROUP BY url
The table ecosystem-activation.indonesia.stop_words basically is a list of stop words that we’using. The result will provide us the origins and number of stop words found the body response. We only need to set the threshold to identify how many the minimum stop words should be exist in a website’s content. For the references, below are the number of origins based on minimum stop words that we set
=50 stop words = 54,062 URLs
=30 stop words = 59,840 URLs
=20 stop words = 63,250 URLs
We’re assume 50 stop words or more that exist in a content is in Bahasa Indonesia, and it means there 54,062 websites in Bahasa Indonesia.
If we cross all the signals for Indonesia CrUX origins, we only got 431 origins means these websites exist in all signals that we’re using. With the query below we can see how many websites and it’s affected signals based on 4 signals above.
#standardSQL
SELECT signals,COUNT(origin) as origin
FROM (
SELECT origin,COUNT(origin) as signals FROM (
SELECT DISTINCT origin FROM `chrome-ux-report.country_id.201901` WHERE origin LIKE '%.id'
UNION ALL
SELECT DISTINCT RTRIM(page,'/') as origin FROM `ecosystem-activation.indonesia.id_ip_location`
UNION ALL
SELECT RTRIM(url,'/') as origin FROM `ecosystem-activation.indonesia.id_urls_stopwords` WHERE found>50
UNION ALL
SELECT origin FROM `ecosystem-activation.indonesia.id_lang_attribute`)
GROUP BY origin
)
GROUP BY signals ORDER BY origin DESC
The result below is showing us that only 25% websites which we’re able to detect from total 73,654 and has more than 1 signal.
Summary
From all 4 signals that we have above, domain suffix and ip location are the strongest signals we have. The language attribute and content detection with stopwords still need a verification to see how strong the signals on detecting the local sites. We love to see your feedback on this approach or even come with more signal detection method.
Thanks,