numDomElements data


#1

Very frequently there are discussions about complexity based on the size of the DOM. A number of features aren’t implemented because of the complexity they would create with large trees - the number of DOM elements on a page. Inevitably this involves discussions on how big trees really are in practice/on average. There are pages like the HTML Living Standard, Single Page edition that has > 140k elements, but it’s widely accepted that this isn’t the norm - the question for me has always been how can we get decent information on the norms…

I noticed that the HTTP Archive includes some information about the average # DOM elements

http://mobile.httparchive.org/interesting.php#numDomElements

However, this would seem to imply that 3200 is the max # of elements in the DOM tree in the top million sites… are there 0 in this dataset with >3200? If so, assuming there was one, would it show up as a new bucket with a range, or would it just show up in an “>3200” bucket?


#2

As a start, a quick way to enumerate the “heaviest” sites with respect to number of DOM elements:

SELECT url, numDomElements 
FROM [runs.2016_05_01_pages] 
ORDER BY numDomElements desc
LIMIT 10

Results:

1	http://www.faqexplorer.com/	120086	 
2	http://www.neejolie.fr/	89844	 
3	http://www.groupintown.it/	84249	 
4	http://www.webjars.org/	76956	 
5	http://www.nomadlist.com/	74064	 
6	http://www.xn--4ogoszenia-c0b.pl/	68784	 
7	http://www.sakc.jp/	64500	 
8	http://www.kyujin-ascom.com/	62185	 
9	http://www.mazmize.com/	60091	 
10	http://www.bravelets.com/	58149

Restricting to “top 1k” sites:

SELECT url, numDomElements 
FROM [runs.2016_05_01_pages] 
WHERE rank < 1000
ORDER BY numDomElements desc
LIMIT 10

Results:

Row	url	numDomElements	 
1	http://www.imzog.com/	22482	 
2	http://www.seasonvar.ru/	19007	 
3	http://www.momoshop.com.tw/	12563	 
4	http://www.paytm.in/	11274	 
5	http://www.paytm.com/	10219	 
6	http://www.dianping.com/	9488	 
7	http://www.xcar.com.cn/	8541	 
8	http://www.newegg.com/	8321	 
9	http://www.txxx.com/	8208	 
10	http://www.staples.com/	8123	

Closer to your actual question… Quantiles for element counts:

SELECT  
  NTH(10, quantiles(numDomElements,101)) tenth,
  NTH(20, quantiles(numDomElements,101)) twentieth,
  NTH(30, quantiles(numDomElements,101)) thirtieth,
  NTH(40, quantiles(numDomElements,101)) fortieth,
  NTH(50, quantiles(numDomElements,101)) fiftieth,
  NTH(60, quantiles(numDomElements,101)) sixtieth,
  NTH(70, quantiles(numDomElements,101)) seventieth,
  NTH(80, quantiles(numDomElements,101)) eightieth,
  NTH(90, quantiles(numDomElements,101)) ninetieth,
  NTH(95, quantiles(numDomElements,101)) ninety_fifth,
  NTH(99, quantiles(numDomElements,101)) ninety_ninth
FROM [httparchive:runs.latest_pages]


#3

And perhaps the initial question is why the histogram stops at 3200. I believe I don’t show any histograms where the percentage is < 1%. A possible improvement in this situation might be to say “> 2800” (rather than “2801-3200”).


#4

Yeah, this was effectively my question - looking at the chart it seems to imply that there just aren’t sites bigger than 3200 elements or something, didn’t seem right. I guess I misunderstood, but maybe others would too