The most frequently used anchor text

rviscomi · September 18, 2017, 6:01pm

Recently, in another project, the topic of descriptive anchor text came up. Specifically, we were looking to run an SEO audit on any given web page and call out links on the page that could possibly have more descriptive text to assist search crawlers. So that begs the question, what are examples of nondescript anchor text?

To answer this question, I instrumented HTTP Archive’s WebPageTest agents with a one-time custom metric that collects every anchor’s innerText on the page. The September 1 crawl’s HAR data includes this metric under the name _anchor_text.

To start working with the data, I created a new scratchspace table anchors and populated it with every single anchor value and the URL of the page on which it appeared. Here’s the query used to build the table:

CREATE TEMPORARY FUNCTION parseJson(anchors STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
  try {
    anchors = JSON.parse(anchors);
    if (!Array.isArray(anchors)) return [];
    return anchors;
  } catch (e) {
    return [];
  }
""";

SELECT
  url,
  anchor
FROM (
  SELECT
    url,
    parseJson(TRIM(JSON_EXTRACT(payload, "$._anchor_text"), '"')) AS anchors
  FROM
    `httparchive.har.2017_09_01_chrome_pages`)
CROSS JOIN
  UNNEST(anchors) AS anchor
WHERE
  anchor != ''

The inner select statement uses a JS function to help parse the JSON-encoded array of anchor text out of the also-JSON-encoded HAR payload. Note that we’re only looking at the desktop tests’ results. The mobile data is available in the har.2017_09_01_android_pages table for the curious. CROSS JOIN UNNEST(anchors) AS anchor is a really useful way to convert a single row with repeated fields into multiple rows of fields. Finally, in the outer select statement, one anchor at a time is selected along with its corresponding URL (of the page, not the anchor’s href value) and empty anchors are excluded.

The resulting scratchspace table is huge. It is 3.4 GB and 67,485,195 rows. By working with the data here, the queries run much more quickly than having to do the JSON parsing at the same time. The queries are also much simpler to write. For example, to get the top 5 case-insensitve anchor values:

SELECT APPROX_TOP_COUNT(LOWER(anchor), 5) FROM `httparchive.scratchspace.anchors`

That’s all we need. Before I show you the results, take a moment to think about what you might expect to see. I ran this Twitter poll for a couple of days and got 267 people’s guesses as to what the #1 anchor value is.

The majority of voters seem to agree that “read more” is the most popular, followed by “home”, with “0” and “►” far behind. How well does their intuition align with the data?

The Results

Rank  Anchor      Volume    Distinct URLs    Avg
------------------------------------------------
1     ►          241,729           2,078     116
2     0          180,349          21,272       8
3     read more  165,620           1,693       6
4     2          150,310          89,651       1
5     home       148,120           2,136       1

Believe it or not the triangle is the most popular anchor text. I know, I didn’t believe it either. So I tried to get a sense of the spread of the anchors by querying for the number of distinct URLs with each anchor. This query is a bit more complicated than the simple APPROX_TOP_COUNT above, but it gives more insight into the data:

SELECT
  top.value AS anchor,
  top.count AS volume,
  distinctUrls
FROM (
  SELECT
    APPROX_TOP_COUNT(LOWER(anchor), 100) AS tops
  FROM
    `httparchive.scratchspace.anchors`)
CROSS JOIN
  UNNEST(tops) AS top
LEFT JOIN (
  SELECT
    COUNT(DISTINCT url) AS distinctUrls,
    LOWER(anchor) AS anchor
  FROM
    `httparchive.scratchspace.anchors`
  GROUP BY
    anchor)
ON top.value = anchor
ORDER BY
  volume DESC

I manually added the “Avg” column, which is just the floor of the volume divided by the count of distinct URLs. See the Top 100 Anchor Text spreadsheet for more info.

So it’s clear that ► occurs many times per page on average, about 116, way more than any other anchor. I took a sample of some these pages and I laughed at what I found. I was expecting to see hundreds of media players, with ► being the play button. (HTTP Archive crawls a lot of spammy/questionable sites, but such is the web). Instead, most of the pages were hosted on blogspot, the domain given for Blogger accounts, and it was being used as an expandable link for blog archives. So blogs with many articles would have one of these triangles for each month and year with a post.

For example, http://android-er.blogspot.in/ has been posting since 2007, so that’s ~120 months of archives or 115 ► links. If all Blogger blogs include this archive widget by default, that feasibly explains why this specific anchor value is the most popular. Go figure!

So that still doesn’t explain the other unexpected result: “0” is the next most popular anchor text. It appears 8 times per page on average, while the other single digit anchor texts only occur once. I can see 1-10 being popular because they could be used for pagination, but what’s so special about 0? So I flipped through some test cases and was caught by surprise again. As it turns out, many pages contain snippets of articles and include a link to the comments, where the anchor text is the number of comments. It makes sense that many articles would have 0 comments.

Comment anchors on indomoto.com

Ok so I can accept the fact that ► and 0 are popular values, but how is it that they’re more popular than “read more”? The answer to this is much simpler, I think, and that is fragmentation. There are many ways to say “read more”. For example, the Top 100 contains variations including:

“learn more” (74,043)
“more” (58,369)
“更多” and “更多>>” (“more” in Chinese, 46,402 combined)
“подробнее” (“more” in Russian, 32,197)
“read more »” (29,480)
“read more…” (17,730)

When combined, these variations cover much more than the ~76,000 gap between “read more” and ►. You could say they’re the Ralph Nader of anchor texts. Al Gore didn’t win that year, so it’s only fair that “read more” shouldn’t either. Henceforth I’m officially declaring ► the winner of the prestigious title of Most Frequently Used Anchor Text. So, were you in the 11% who guessed correctly?

charlie.clark · September 18, 2017, 6:29pm

Hi Rick,
now I understand what you were trying to do with the data collection. As you note there are a lot of symbols used to help signposts links for “more” (combining the two is “doubly determined” in semantics, making it “obvious”, the converse is the hamburger without a textual signifier). And you hit the first semantic roadblock in your first analysis with the triangle being used to toggle on a page and not load new content, so you should really be stripping these from your analysis…

I’ve worked with clients to try and force the use of such decoration through CSS only and that took several years. However, you really only need to change your heuristics slightly: “read more(\s+)?|learn more|more” or something similar and the more variants would come out handsomely on top.

paulcalvano · September 18, 2017, 10:31pm

Awesome analysis @rviscomi!

This made me curious about other single character anchors, so I tweaked your query a bit to only output anchor text where LENGTH(anchor)=1. It looks like the anchor #s go from 0, 2, 3, 1 and then 4, 5, 6, 7, 8, 9.

The tree map below summarizes this. The size of each box represents the number of anchors containing that text. The color of the box represents the number of distinct URLs, and 0-9 are clearly among the most frequent (as you said, it’s likely due to pagination). Most of remaining top single character anchors consist of symbols used in navigation links (x + » X -) and western alphanumeric characters - although there are many non-western characters that make up the lower ~11K single character anchors in the lower right…

rviscomi · September 19, 2017, 4:52pm

Hah that’s neat, thanks for sharing that @paulcalvano! Interesting to see the variations of x’s, which I assume are used to close dialogs.

Good point. In the Lighthouse audit we’re definitely going to only be looking at anchors that intend to navigate somewhere and not just used for JavaScript. That aspect is important from an SEO perspective since the content of the anchor text will be describing where the user is going rather than what they’re doing. This analysis is more of a fun look into all the ways anchors are being used in the wild, without so much care for SEO-correctness.

simevidas · September 27, 2017, 1:17pm

FWIW, regarding Blogger’s use of ► links, it’s now actually possible to implement their treeview using nested <details> elements (no JavaScript required): https://output.jsbin.com/semoqa/quiet. I hope <details> can eventually eradicate ► links.

rviscomi · September 27, 2017, 5:23pm

@simevidas that’s awesome! Today I learned something new

Thanks for sharing.

Topic		Replies	Views
JavaScript Library Detection Analysis	19	14175	October 26, 2018
How and where is document.write used on the web? Analysis	11	5359	July 6, 2017
How can you query HTTP Archive to yield outbound links (and domains) from homepages? Analysis	4	1387	August 17, 2020
Exploring Relationships Between Performance Metrics in HTTP Archive Data Analysis	4	3978	May 14, 2019
Use of custom elements with attributes Analysis	1	1492	March 12, 2019

The most frequently used anchor text

The Results

Related topics