Identifying The Local Websites To Understand The State of The Web in A Country

tyohan · March 8, 2019, 10:54am

We know the CrUX report already split the datasets based on the country. But the origins in that country datasets doesn’t mean that the sites are local and developed by local developers. So to understand the state of the web in a country, we’re trying to find a way on identifying which sites are actually developed by local developers. And below are 4 signals that we’re using to identify the local websites in my country Indonesia. Thanks for @rviscomi and @paulcalvano for helping on ip location, and language attribute signals.

Identifying based on country suffix domain.

Indonesia has .id suffix for domain. The domain only available for Indonesian because to register it you need to provide a country legal document. We’re using CrUX country dataset for this signal and find all origins in a latest CrUX country dataset. For example below are the query that using January 2019 dataset.

#standardSQL
SELECT DISTINCT origin FROM `chrome-ux-report.country_id.201901` WHERE origin LIKE '%.id'

Using the query above, we can get 19,580 or 11.6% origins from 168,591 origins in chrome-ux-report.country_id.201901 dataset. The country domain with id suffix signal is the strongest signal that identifies a website must be an Indonesia website because there is no way to buy an .id suffix domain without government legal document.

Identifying based on server location

We’re assuming if the website’s servers are located in a country where the country is not a data center for cloud services than it must be deployed in that country because it’s required to deployed in that country(Indonesia has a regulation for a website that has financial data, the data server must be deployed in Indonesia), or to be closer with it’s users.

Following the discussion with @rviscomi and @paulcalvano we come with the query

#standardSQL
SELECT page, count(*)
FROM (
SELECT page, url,
        INTEGER(PARSE_IP(REGEXP_REPLACE(JSON_EXTRACT(payload, "$._ip_addr"), r"([\"]+)",""))) AS clientIpNum,
        INTEGER(PARSE_IP(REGEXP_REPLACE(JSON_EXTRACT(payload, "$._ip_addr"), r"([\"]+)",""))/(256*256)) AS classB
FROM [httparchive:latest.requests_mobile]
 WHERE JSON_EXTRACT(payload, "$._final_base_page") = "true"
  ) AS a
JOIN EACH [fh-bigquery:geocode.geolite_city_bq_b2b] AS b
ON a.classB = b.classB
JOIN EACH (SELECT CONCAT(origin, '/') AS origin FROM [chrome-ux-report:country_id.201901]) crux
ON url = origin
WHERE a.clientIpNum BETWEEN b.startIpNum AND b.endIpNum AND countryLabel = "Indonesia"
GROUP BY page

The query above produce 9,123 origins or 5.4%

If we combine the query from suffix and ip based signals we got 24,340 total origins and if we cross the data from both signals there are 4,363 origins from ip location signals that also with .id suffix domain.

Identifying based on language attribute

There are also a possibility to use language attribute in HTML body response like @rviscomi explain here. Using the language attribute with the query below, we can detect 19,506 origins from

#standardSQL
SELECT
  DISTINCT origin
FROM
  `httparchive.response_bodies.2019_01_01_mobile`
JOIN
  `chrome-ux-report.country_id.201901`
ON
  CONCAT(origin, '/') = page
WHERE REGEXP_EXTRACT(body, '(?i)<html[^>]*lang=[\'"]?([a-z]{2})')='id'

Using the query above, from 19,506 origins only 518 origins that exist in both .id suffix based signal and ip location based signal. There are 2,172 origins from above query that also exist in .id suffix based signal, and there are 1,291 origins from query above that also exist in ip location based signal.

Identifying based on content language

We also assume if a website provided a content in a specific language, it could be also developed by Indonesian. So to enable us identifying content in a specific language which in this case in Bahasa Indonesia using big query, we’re using the stop words detection. HTTP archive response dataset provided us with body responses string data of the websites. We’re using the stop words list that we were found in Github that contains 758 stop words from Bahasa Indonesia.

Before able to detect the content based on the HTML body response, we need to tokenized first the string response and store it in temporary table. Count the number of stop words directly in string response will hit the memory limit of big query, so it’s not possible to run the SQL query directly without temporary table. Below is the query to tokenize the string response from HTTP Archive dataset.

#standardSQL
SELECT url, SPLIT(REPLACE(REPLACE(resp.body,".",""),",","")," ") AS tokens 
FROM `httparchive.response_bodies.2019_02_01_mobile` AS resp 
  LEFT JOIN `chrome-ux-report.country_id.201901` AS crux 
  ON resp.url = CONCAT(crux.origin,'/') 
  WHERE resp.url=resp.page

Then we run a query to detect how many stop words exist per origin with this query

#standardSQL
SELECT url, count(tokens) as found 
FROM `ecosystem-activation.indonesia.body_tokenized` CROSS JOIN UNNEST(tokens) as token
WHERE token IN (SELECT word FROM `ecosystem-activation.indonesia.stop_words`)
GROUP BY url

The table ecosystem-activation.indonesia.stop_words basically is a list of stop words that we’using. The result will provide us the origins and number of stop words found the body response. We only need to set the threshold to identify how many the minimum stop words should be exist in a website’s content. For the references, below are the number of origins based on minimum stop words that we set

=50 stop words = 54,062 URLs

=30 stop words = 59,840 URLs

=20 stop words = 63,250 URLs

We’re assume 50 stop words or more that exist in a content is in Bahasa Indonesia, and it means there 54,062 websites in Bahasa Indonesia.

If we cross all the signals for Indonesia CrUX origins, we only got 431 origins means these websites exist in all signals that we’re using. With the query below we can see how many websites and it’s affected signals based on 4 signals above.

#standardSQL
SELECT signals,COUNT(origin) as origin 
FROM (
  SELECT origin,COUNT(origin) as signals FROM (
  SELECT DISTINCT origin FROM `chrome-ux-report.country_id.201901` WHERE origin LIKE '%.id'
  UNION ALL
  SELECT DISTINCT RTRIM(page,'/') as origin FROM `ecosystem-activation.indonesia.id_ip_location`
  UNION ALL
  SELECT RTRIM(url,'/') as origin FROM `ecosystem-activation.indonesia.id_urls_stopwords` WHERE found>50
  UNION ALL
  SELECT origin FROM `ecosystem-activation.indonesia.id_lang_attribute`)
  GROUP BY origin
) 
GROUP BY signals ORDER BY origin DESC

The result below is showing us that only 25% websites which we’re able to detect from total 73,654 and has more than 1 signal.

Summary

From all 4 signals that we have above, domain suffix and ip location are the strongest signals we have. The language attribute and content detection with stopwords still need a verification to see how strong the signals on detecting the local sites. We love to see your feedback on this approach or even come with more signal detection method.

Thanks,

gokulkrishh · March 8, 2019, 11:12am

Looking forward to know more!!

paulcalvano · March 8, 2019, 2:28pm

Really awesome work @tyohan!

@rviscomi and I worked on converting the Legacy SQL IP geolocation example to Standard SQL yesterday. Here’s the updated query, which also identifies 13,337 unique origins that have an IP address located in Indonesia.

SELECT DISTINCT page
FROM (
  SELECT page, url,
         JSON_EXTRACT_SCALAR(payload,"$._ip_addr") AS ip,
         NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(JSON_EXTRACT_SCALAR(payload, "$._ip_addr"))) clientIpNum,
         TRUNC(NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(JSON_EXTRACT_SCALAR(payload, "$._ip_addr")))/(256*256))  classB
  FROM `httparchive.latest.requests_mobile`
  WHERE JSON_EXTRACT(payload, "$._final_base_page") = "true")
JOIN `fh-bigquery.geocode.geolite_city_bq_b2b`
USING (classB)
WHERE clientIpNum BETWEEN startIpNum AND endIpNum AND countryLabel = "Indonesia"

Note: the caveat with this signal is that it will not count sites that are using CDN, and the geolocation DB in that public dataset is a few years old.

tomayac · March 12, 2019, 11:53pm

This is great, @tyohan! Likewise thanks @paulcalvano for the IP geolocation update. Would be interesting to run this on vanity TLDs like .io or the new .dev, but also .cat, where there are language requirements to have significant parts of the content in Catalan (second question in the FAQ, hope Google Translate helps you make sense of it if you don’t speak Catalan), so one of your signals just goes away.

igrigorik · March 13, 2019, 4:26am

Nice work folks! What about user behavior signals?

Specifically, if you do [list of origins for country A] - [list of origins for country B] that gives you an interesting signal for what origins users in A visit that users in B do not. When I ran this test for ID - US, manual check yielded a lot of ID content…

You can also abstract the above… One could define a “globalness” score for an origin that’s based on number of CrUX country datasets that it appears in — e.g. I would expect reddit.com to be present in most every country dataset and hence have a high score. Sites with a score of 1 are a strong signal for “this country only” and from there I would expect to see regional clusters and such.

Concretely, some questions…

How many origins only appear in Indonesia dataset and no other countries?
How many origins appears in Indonesia + {1,2,3,4,5} other countries?

In the future, if we ever expose some popularity signals for an origin, that would offer a strong enhancement to this signal as well, as it would allow us to distinguish between origins that are predominantly visited by in A vs origins that are sometimes visited by users in B.

tyohan · March 15, 2019, 6:33am

@tomayac Yes, we’re aware that not all the country strictly required government ID/doc to buy the country domain. Hopefully there will be another strong signal that could replace the domain signal. Then content detection with stop words should be applied for Catalan as well. Some of interesting analysis ideas will be how many Catalan’s website are using .io or .dev domain? Are they have a better performance compare with regular domains?

paulcalvano:

SELECT DISTINCT page
FROM (
  SELECT page, url,
         JSON_EXTRACT_SCALAR(payload,"$._ip_addr") AS ip,
         NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(JSON_EXTRACT_SCALAR(payload, "$._ip_addr"))) clientIpNum,
         TRUNC(NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(JSON_EXTRACT_SCALAR(payload, "$._ip_addr")))/(256*256))  classB
  FROM `httparchive.latest.requests_mobile`
  WHERE JSON_EXTRACT(payload, "$._final_base_page") = "true")
JOIN `fh-bigquery.geocode.geolite_city_bq_b2b`
USING (classB)
WHERE clientIpNum BETWEEN startIpNum AND endIpNum AND countryLabel = "Indonesia"

Thanks for the update @paulcalvano. With the update query, i got the better result

tyohan · March 15, 2019, 8:01am

Thanks @igrigorik this is an interesting signal to explore. I started to explore with South East Asian countries so I come with the query to count how many time an origin appear in each country datasets.

SELECT origin, COUNT(origin) as appear
FROM (
  SELECT DISTINCT origin FROM `chrome-ux-report.country_id.201901`
  UNION ALL
  SELECT DISTINCT origin FROM `chrome-ux-report.country_sg.201901`
  UNION ALL
  SELECT DISTINCT origin FROM `chrome-ux-report.country_my.201901`
  UNION ALL
  SELECT DISTINCT origin FROM `chrome-ux-report.country_kh.201901`
  UNION ALL
  SELECT DISTINCT origin FROM `chrome-ux-report.country_bn.201901`
  UNION ALL
  SELECT DISTINCT origin FROM `chrome-ux-report.country_ph.201901`
  UNION ALL
  SELECT DISTINCT origin FROM `chrome-ux-report.country_vn.201901`
  UNION ALL
  SELECT DISTINCT origin FROM `chrome-ux-report.country_la.201901`
  UNION ALL
  SELECT DISTINCT origin FROM `chrome-ux-report.country_mm.201901`
  UNION ALL
  SELECT DISTINCT origin FROM `chrome-ux-report.country_th.201901`
  UNION ALL
  SELECT DISTINCT origin FROM `chrome-ux-report.country_tl.201901`
) 
GROUP BY origin ORDER BY appear DESC

From that query we can see popular origins like Twitter and Google will exist in all countries. I saved the result to sea_origins table, then I ran the query below to find the origins that only appear in Indonesia dataset

SELECT DISTINCT origin 
FROM `chrome-ux-report.country_id.201901` 
WHERE origin NOT IN (
  SELECT origin 
  FROM `ecosystem-activation.indonesia.sea_origins` 
  WHERE appear >1
)

The result is quite interesting and I can find 128,859 origins that only appear in Indonesia dataset. I check the first 50 origins and it’s all Indonesia websites. Some of the origins are Blogspot or WordPress.com site. But I found some of origins not really an Indonesia website like https://www.charlieputh.com/ which is an international musician artist but only appear in Indonesia dataset. Is it mean only Indonesian listening to him? I don’t think so, he must be popular also in other countries. Another one is https://www.khmerchords.com which suppose to be a Cambodia website. May be some Cambodians in Indonesia like to sing their songs?

Even there some websites are not really Indonesia website but it still looks like a good signal. I’m thinking to cross this data with stop words signal because this signal has the biggest result, and stop words is the second.

Thanks to pointing out this signal @igrigorik.

tjmonsi · March 18, 2019, 6:58am

Hi @paulcalvano

I was trying your query about servers that are residing in the Philippines using their IP Address with country label = “Philippines”

I discussed this with @tyohan about my query because it returned non-Philippine servers and non-Philippine websites.

Did I do something wrong?

#standardSQL
SELECT DISTINCT page, clientIpNum, startIpNum, endIpNum, countryLabel
FROM (
  SELECT page, url,
         JSON_EXTRACT_SCALAR(payload,"$._ip_addr") AS ip,
         NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(JSON_EXTRACT_SCALAR(payload, "$._ip_addr"))) clientIpNum,
         TRUNC(NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(JSON_EXTRACT_SCALAR(payload, "$._ip_addr")))/(256*256))  classB
  FROM `httparchive.latest.requests_mobile`
  WHERE JSON_EXTRACT(payload, "$._final_base_page") = "true")
JOIN `fh-bigquery.geocode.geolite_city_bq_b2b`
USING (classB)
WHERE clientIpNum BETWEEN startIpNum AND endIpNum AND countryLabel = "Philippines"

tyohan · March 18, 2019, 7:09am

FYI i did the query and found out there is a website called http://www.penanganankartukredit.com which is an Indonesia website. I did NSLookup and it return the ip 203.114.75.116 which based in Singapore. Do you think the Geolite dataset is not updated @paulcalvano?

paulcalvano · March 19, 2019, 2:07am

Hi @tjmonsi and @tyohan. I just checked that IP address against MaxMind’s lookup tool and it shows that IP is in Singapore as well.

The geolocation table that we’re using via BigQuery seems to be a bit dated as it was last updated in 2013. I suspect that is causing a few false positives.

It looks like @fhoffa uploaded this DB back in 2013 (See https://cloudplatform.googleblog.com/2014/03/geoip-geolocation-with-google-bigquery.html for more details). According to https://dev.maxmind.com/geoip/geoip2/geolite2/, the Lite version appears to be discontinued though.

ymschaap · March 19, 2019, 6:42pm

@paulcalvano Small detail: Only GeoLite “v1” is discontinued, but GeoLite2 is not and available as CSV.

tyohan · March 20, 2019, 4:32am

I saw in Bigquery there are Geolite2 datasets available here.

tyohan · March 20, 2019, 9:24am

Update on origins that only accessed from a country. I’m trying to use global origins and try to union all the country datasets. So below are the query to combine all the origins from all the country on Jan 2019 datasets.

#sqlStandard
SELECT origin, COUNT(origin) as appearances FROM (
SELECT DISTINCT origin FROM `chrome-ux-report.country_ad.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ae.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_af.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ag.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ai.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_al.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_am.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ao.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ar.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_as.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_at.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_au.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_aw.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ax.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_az.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ba.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_bb.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_bd.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_be.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_bf.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_bg.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_bh.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_bi.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_bj.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_bl.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_bm.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_bn.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_bo.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_bq.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_br.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_bs.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_bt.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_bw.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_by.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_bz.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ca.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_cd.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_cf.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_cg.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ch.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ci.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ck.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_cl.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_cm.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_cn.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_co.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_cr.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_cu.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_cv.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_cw.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_cx.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_cy.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_cz.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_de.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_dj.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_dk.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_dm.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_do.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_dz.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ec.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ee.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_eg.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_eh.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_er.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_es.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_et.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_fi.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_fj.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_fk.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_fm.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_fo.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_fr.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ga.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_gb.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_gd.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ge.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_gf.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_gg.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_gh.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_gi.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_gl.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_gm.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_gn.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_gp.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_gq.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_gr.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_gt.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_gu.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_gw.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_gy.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_hk.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_hn.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_hr.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ht.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_hu.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_id.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ie.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_il.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_im.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_in.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_io.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_iq.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ir.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_is.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_it.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_je.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_jm.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_jo.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_jp.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ke.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_kg.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_kh.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ki.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_km.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_kn.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_kp.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_kr.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_kw.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ky.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_kz.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_la.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_lb.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_lc.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_li.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_lk.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_lr.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ls.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_lt.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_lu.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_lv.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ly.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ma.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_mc.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_md.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_me.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_mf.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_mg.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_mh.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_mk.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ml.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_mm.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_mn.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_mo.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_mp.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_mq.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_mr.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ms.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_mt.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_mu.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_mv.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_mw.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_mx.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_my.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_mz.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_na.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_nc.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ne.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_nf.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ng.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ni.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_nl.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_no.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_np.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_nr.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_nz.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_om.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_pa.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_pe.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_pf.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_pg.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ph.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_pk.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_pl.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_pm.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_pr.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ps.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_pt.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_pw.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_py.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_qa.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_re.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ro.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_rs.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ru.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_rw.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_sa.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_sb.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_sc.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_sd.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_se.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_sg.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_sh.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_si.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_sj.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_sk.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_sl.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_sm.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_sn.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_so.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_sr.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ss.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_st.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_sv.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_sx.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_sy.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_sz.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_tc.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_td.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_tg.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_th.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_tj.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_tl.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_tm.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_tn.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_to.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_tr.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_tt.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_tv.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_tw.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_tz.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ua.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ug.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_us.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_uy.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_uz.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_vc.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ve.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_vg.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_vi.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_vn.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_vu.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ws.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_xk.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_ye.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_yt.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_za.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_zm.201901` UNION ALL
SELECT DISTINCT origin FROM `chrome-ux-report.country_zw.201901`
) 
GROUP BY origin

Better to save the results to a temp table so we can use it for next queries. For this example, I saved the results in origins_global_appearences table. And to query the origins that only appears on a country I use the query below

#sqlStandard
SELECT DISTINCT origin FROM `chrome-ux-report.country_id.201901` 
WHERE origin in (
  SELECT origin 
  FROM `ecosystem-activation.indonesia.origins_global_appearences`
  WHERE appearances = 1
  )

@gokulkrishh I tried to combine this query with .in domain suffix query and I able to get 149,054 origins.

tjmonsi · March 21, 2019, 3:36am

I’ll try this out on the weekends. Thanks @tyohan and @paulcalvano

tyohan · July 3, 2019, 3:51am

I like to update a shorter query that i’ve been use to identify local sites in a country. The query below basically count number of appearances of origins in all countries and then query if the origins only appears once in a country. In the example below, the query is producing all local sites in Indonesia because the origins only accessed within Indonesia.

with appearances AS (SELECT 
  DISTINCT origin,
  yyyymm,
  COUNT(DISTINCT country) as appearances
FROM
  `chrome-ux-report.materialized.country_summary`
GROUP BY origin,yyyymm)

SELECT
  COUNT(DISTINCT cs.origin) AS origins,
  cs.yyyymm AS month
FROM
  `chrome-ux-report.materialized.country_summary` AS  cs
  INNER JOIN 
  appearances AS A
    ON cs.origin=a.origin AND cs.yyyymm = a.yyyymm
WHERE 
  a.appearances=1 AND cs.country_code='id'
GROUP BY
  cs.yyyymm
ORDER by cs.yyyymm

And if I query both all origins and local origins in Indonesia, this is how the growth of origins and local origins accessed by Indonesian.

Indonesia%20origins%20and%20local%20origins

Topic		Replies	Views
How many base pages use origins/CDNs within the country borders? Analysis	3	941	November 25, 2019
Growth of the Web in 2020 Analysis	0	1993	September 29, 2020
CDN penetration by country Analysis	1	982	July 7, 2020
CDN Usage - December 2018 Analysis	6	2242	December 31, 2018
Segmenting HTTP Archive results by rank with CrUX Analysis	1	1602	April 20, 2021