I am a postdoctoral fellow working on a research project. We will be scraping approximately 24.6 million webpages, and I have some questions related to that task.
- We are trying to estimate the storage necessary to store the raw HTML we download. We are also potentially interested in downloading the “important images” from those 24.6 million URLs. When I say “important images”, I’m referring to images that would be considered content, rather than those that are there for branding or stylistic/aesthetic purposes.
The text users see when browsing a URL is contained in the HTML, and according to httparchive’s Page Weight report, that’s only 27.2 KB/page for the median. This comes to a very manageable 0.67 TB. As an upper bound, I wanted to calculate how much space we would need to store all the images and all the HTML. But the fact that the different element sizes don’t add up to the total size (see my previous “topic”, entitled “Total size doesn’t equal size of elements”) made me uncertain how to interpret the image size numbers. I thought that perhaps this represented the median number of image bytes transferred on webpages that have any image bytes (so, does not include webpages with no images). If that’s the case, then I can’t simply say “all images” plus “all html” = 27.2 KB HTML + 982.3 KB images, for a total of 1009.5 KB, or approximately 24.8 TB, which we may not be able to do. What would be the proper way to do this estimate using httparchive’s data? I’m willing to sign up and access through BigQuery if necessary, but it would be nice not to have to jump through those hoops.
- A number of the pages we’re scraping have paywalls, or GDPR (cookie) consent pages that we need to get past, and we are currently implementing a solution using Selenium to “click” on the necessary buttons to accept cookies and/or log into the paywall. In most cases we are able to use Selenium’s built-in methods to find the button we want, wait until it is interactable, and then click on it or enter our information into it. But in some cases I seem to recall that this won’t work, and I was wondering where I should get my wait time. I see a few options, but TTI seems most appropriate. Unfortunately I only see a chart for TTI for mobile, and we’re looking at desktop browsing data. Is there another way of estimating the wait time that you would suggest?