Recently, in another project, the topic of descriptive anchor text came up. Specifically, we were looking to run an SEO audit on any given web page and call out links on the page that could possibly have more descriptive text to assist search crawlers. So that begs the question, what are examples of nondescript anchor text?
To answer this question, I instrumented HTTP Archive’s WebPageTest agents with a one-time custom metric that collects every anchor’s innerText on the page. The September 1 crawl’s HAR data includes this metric under the name _anchor_text
.
To start working with the data, I created a new scratchspace table anchors
and populated it with every single anchor value and the URL of the page on which it appeared. Here’s the query used to build the table:
CREATE TEMPORARY FUNCTION parseJson(anchors STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
try {
anchors = JSON.parse(anchors);
if (!Array.isArray(anchors)) return [];
return anchors;
} catch (e) {
return [];
}
""";
SELECT
url,
anchor
FROM (
SELECT
url,
parseJson(TRIM(JSON_EXTRACT(payload, "$._anchor_text"), '"')) AS anchors
FROM
`httparchive.har.2017_09_01_chrome_pages`)
CROSS JOIN
UNNEST(anchors) AS anchor
WHERE
anchor != ''
The inner select statement uses a JS function to help parse the JSON-encoded array of anchor text out of the also-JSON-encoded HAR payload. Note that we’re only looking at the desktop tests’ results. The mobile data is available in the har.2017_09_01_android_pages
table for the curious. CROSS JOIN UNNEST(anchors) AS anchor
is a really useful way to convert a single row with repeated fields into multiple rows of fields. Finally, in the outer select statement, one anchor at a time is selected along with its corresponding URL (of the page, not the anchor’s href
value) and empty anchors are excluded.
The resulting scratchspace table is huge. It is 3.4 GB and 67,485,195 rows. By working with the data here, the queries run much more quickly than having to do the JSON parsing at the same time. The queries are also much simpler to write. For example, to get the top 5 case-insensitve anchor values:
SELECT APPROX_TOP_COUNT(LOWER(anchor), 5) FROM `httparchive.scratchspace.anchors`
That’s all we need. Before I show you the results, take a moment to think about what you might expect to see. I ran this Twitter poll for a couple of days and got 267 people’s guesses as to what the #1 anchor value is.
The majority of voters seem to agree that “read more” is the most popular, followed by “home”, with “0” and “►” far behind. How well does their intuition align with the data?
The Results
Rank Anchor Volume Distinct URLs Avg
------------------------------------------------
1 ► 241,729 2,078 116
2 0 180,349 21,272 8
3 read more 165,620 1,693 6
4 2 150,310 89,651 1
5 home 148,120 2,136 1
Believe it or not the triangle is the most popular anchor text. I know, I didn’t believe it either. So I tried to get a sense of the spread of the anchors by querying for the number of distinct URLs with each anchor. This query is a bit more complicated than the simple APPROX_TOP_COUNT
above, but it gives more insight into the data:
SELECT
top.value AS anchor,
top.count AS volume,
distinctUrls
FROM (
SELECT
APPROX_TOP_COUNT(LOWER(anchor), 100) AS tops
FROM
`httparchive.scratchspace.anchors`)
CROSS JOIN
UNNEST(tops) AS top
LEFT JOIN (
SELECT
COUNT(DISTINCT url) AS distinctUrls,
LOWER(anchor) AS anchor
FROM
`httparchive.scratchspace.anchors`
GROUP BY
anchor)
ON top.value = anchor
ORDER BY
volume DESC
I manually added the “Avg” column, which is just the floor of the volume divided by the count of distinct URLs. See the Top 100 Anchor Text spreadsheet for more info.
So it’s clear that ► occurs many times per page on average, about 116, way more than any other anchor. I took a sample of some these pages and I laughed at what I found. I was expecting to see hundreds of media players, with ► being the play button. (HTTP Archive crawls a lot of spammy/questionable sites, but such is the web). Instead, most of the pages were hosted on blogspot, the domain given for Blogger accounts, and it was being used as an expandable link for blog archives. So blogs with many articles would have one of these triangles for each month and year with a post.
For example, http://android-er.blogspot.in/ has been posting since 2007, so that’s ~120 months of archives or 115 ► links. If all Blogger blogs include this archive widget by default, that feasibly explains why this specific anchor value is the most popular. Go figure!
So that still doesn’t explain the other unexpected result: “0” is the next most popular anchor text. It appears 8 times per page on average, while the other single digit anchor texts only occur once. I can see 1-10 being popular because they could be used for pagination, but what’s so special about 0? So I flipped through some test cases and was caught by surprise again. As it turns out, many pages contain snippets of articles and include a link to the comments, where the anchor text is the number of comments. It makes sense that many articles would have 0 comments.
Comment anchors on indomoto.com
Ok so I can accept the fact that ► and 0 are popular values, but how is it that they’re more popular than “read more”? The answer to this is much simpler, I think, and that is fragmentation. There are many ways to say “read more”. For example, the Top 100 contains variations including:
- “learn more” (74,043)
- “more” (58,369)
- “更多” and “更多>>” (“more” in Chinese, 46,402 combined)
- “подробнее” (“more” in Russian, 32,197)
- “read more »” (29,480)
- “read more…” (17,730)
When combined, these variations cover much more than the ~76,000 gap between “read more” and ►. You could say they’re the Ralph Nader of anchor texts. Al Gore didn’t win that year, so it’s only fair that “read more” shouldn’t either. Henceforth I’m officially declaring ► the winner of the prestigious title of Most Frequently Used Anchor Text. So, were you in the 11% who guessed correctly?