How to best filter out adult sites

I’d like to remove obvious adult sites from my analysis. Searching these forums for “adult” doesn’t turn up anything helpful, and the _adult_site column in summary_pages is always false (e.g. in summary_pages.2020_03_01_mobile).

My best approach so far has just been to remove records with REGEXP_CONTAINS(url, r"porn|xxx|adult"). Is there a better way? Maybe a table I’m missing?

Good question. We don’t have a way to detect adult sites currently, the _adult_site flag has been deprecated for a while.

Our philosophy has been that adult sites are just another kind of content on the web, so we don’t try to omit them from our routine analyses. In the spirit of better understanding the web, though, it might still be interesting to slice by adult websites and see if there’s any difference in the way they’re built/experienced, so I could see the argument for better detection.

Generally, when we want to improve our detection methodology, we try to bake that in upstream in the Wappalyzer project. I wonder if they’d be open to taking this on as a new category of websites?

Thanks for the Wappalyzer callout. I’ll check that out!