Estimating size of scraped content + important images

I am a postdoctoral fellow working on a research project. We will be scraping approximately 24.6 million webpages, and I have some questions related to that task.

  1. We are trying to estimate the storage necessary to store the raw HTML we download. We are also potentially interested in downloading the “important images” from those 24.6 million URLs. When I say “important images”, I’m referring to images that would be considered content, rather than those that are there for branding or stylistic/aesthetic purposes.

The text users see when browsing a URL is contained in the HTML, and according to httparchive’s Page Weight report, that’s only 27.2 KB/page for the median. This comes to a very manageable 0.67 TB. As an upper bound, I wanted to calculate how much space we would need to store all the images and all the HTML. But the fact that the different element sizes don’t add up to the total size (see my previous “topic”, entitled “Total size doesn’t equal size of elements”) made me uncertain how to interpret the image size numbers. I thought that perhaps this represented the median number of image bytes transferred on webpages that have any image bytes (so, does not include webpages with no images). If that’s the case, then I can’t simply say “all images” plus “all html” = 27.2 KB HTML + 982.3 KB images, for a total of 1009.5 KB, or approximately 24.8 TB, which we may not be able to do. What would be the proper way to do this estimate using httparchive’s data? I’m willing to sign up and access through BigQuery if necessary, but it would be nice not to have to jump through those hoops.

  1. A number of the pages we’re scraping have paywalls, or GDPR (cookie) consent pages that we need to get past, and we are currently implementing a solution using Selenium to “click” on the necessary buttons to accept cookies and/or log into the paywall. In most cases we are able to use Selenium’s built-in methods to find the button we want, wait until it is interactable, and then click on it or enter our information into it. But in some cases I seem to recall that this won’t work, and I was wondering where I should get my wait time. I see a few options, but TTI seems most appropriate. Unfortunately I only see a chart for TTI for mobile, and we’re looking at desktop browsing data. Is there another way of estimating the wait time that you would suggest?

I’m very interested to see what you find in your research!!

Storing images would be a big expense. Some are outright enormous.

#standardSQL
SELECT
  SUM(respSize) AS total_img_bytes
FROM
  `httparchive.summary_requests.2019_10_01_*`
WHERE
  type = 'image'

This results in 19,854,236,980,243 bytes (19.9 TB). HTML response sizes are negligible by comparison.

We use WebPageTest to run our tests, which waits for the network to go quiet before ending the test. @patmeenan is the creator and helps maintain HTTP Archive, so he could elaborate.

What are you using to do the actual scraping? If you use something like puppeteer then it has options for waiting for the page to finish loading (onload event + X seconds of network quiet). If you’re using selenium then it’s more complicated because you don’t get the network activity. Puppeteer would make it relatively easy to walk to DOM for visible images as well.

For the HTML, you’ll need to decide if you want the HTML as-delivered or if you want the DOM serialized after the page has finished loading (with dynamically injected content) - depending on what you are trying to do with the content.

So, the basic sites that don’t have a blocking GDPR page and don’t have a paywall, we’re just using Python requests. For paywall and GDPR websites, we’re using Selenium currently. I haven’t worked much with either Selenium or Puppeteer, so I’m personally open to switching if need be.

I ended up giving up on trying to estimate the storage needed for all the images, as it was prohibitive. Instead, I assumed that there would be one or two contentful images per page, and that we would figure out a way to identify those. Then I assumed we could use an image library to scale down any images that are too large, and store a smaller version of it. With those assumptions, it was down to a size that we could reasonably expect to get approved.

I am interested in what you’re saying about HTML as-delivered vs. the DOM serialized after the page has finished loading. I was aware that content is dynamic, and calculated throughout the loading process. But I admit that when I use requests, or Selenium, I’m not clear on which I’m getting. Since our goal is to get the political/news content our study participants are reading, and we seem to usually get the text of the article, I haven’t been worrying about it.

Selenium does see to have some functions that allow you to wait until a specific element loads to, for instance, click on it. It’s been a week or two since I looked at the code – other urgent tasks have come up – but I feel like there were some things we were trying to do for which those functions wouldn’t work, which is why I was asking about TTI.