No JSON requests 2022-2024

I’m investigating the evolution of the web over time for my masters thesis. One of the things I wanted to map out was the percentage of every type of request per year.

When constructing this graph I realized there were exactly 0 requests with type ‘json’ found in the whole dataset during the period 2022-2024. I specifically tested this for the dates 2022-06-01, 2023-01-01, 2023-06-01, 2024-01-01, 2024-02-01, 2024-06-01 using the query below:

SELECT COUNT(1) FROM `httparchive.crawl.requests`
where type = 'json' AND date = '2024-02-01'

I have 2 questions regarding this:

  1. Is there a specific reason for the missing json requests in these years?
  2. Are there any other requests/request-types missing that I should know about for my thesis?

Thank you in advance for your answer.

Before the November 2024 crawl which includes this change, JSON didn’t have it’s own type and was treated as a script type.

As you’ve not doubt noticed, querying for as specific type is a lot quicker (and easier!) than looking across all requests, which is why we added the json type in the first place.

We’re continually enhancing the dataset as we find more improvements such as this, but don’t always backport the fixes since it may cause confusion to those rerunning queries.

We don’t explicitly list every single change in a changelog (though the major ones are listed here), so it’s really a matter of looking at the git history when you spot an issue like this. This file’s history is probably of the most interest and you can see I also improved our media detection in some changes there.

Hi,

Thank you for your quick response. That file’s history is indeed useful to view for changes after 2024. Where was this code located before 2024?

In my data I see that JSON had its own type from 2016-01-01 until 2022-01-01. As you mentioned this made it quicker and easier to query so I was wondering why it was removed.

Oh that’s interesting!

The data pipeline has been through several iterations - many before my time. This included a Java one, which was then migrated to Python, which in 2022 was migrated to a new code base, before in 2024 moving to the current agent-based one.

Mostly of these changes were done for reasons of scale as the dataset grew and grew and struggled to cope with the older technologies.

So it’s entirely possible we missed something during one of those migrations and so lost some data. Having a quick look for JSON I don’t see any specific logic for that.

However, we also migrated data model in that time and I think type may have been a new field added in 2022. So it’s also possible when we migrated all the pre-2022 data to the new data model we handled this better as that backfill was done after the previously mentioned change in 2024 IIRC, so basically used that new 2024 logic to backfill.