Keeping `/etc/mime.types` up to date


#1

IANA keeps track of MIME types but not file extensions.
In the absence of an authoritative directory of file extensions, I thought I’d use HTTP Archive to check if /etc/mime.types is up to date.
I got a list of MIME types and file extensions from httparchive:runs.latest_requests easily enough, but quickly discovered it’s very noisy.
Undaunted, I tried using the frequency of requests for each MIME type and file extension to identify good /etc/mime.types candidates.
My theory was that higher frequency would correlate to better candidates. Also the best file extension for a given MIME type would be the one with the most requests, and vice versa.
For a while I played around manually with various equations based on the frequency and similarity of requests for each MIME type and file extension.
Then I decided to apply regression to identify the “best” proportion of frequency and similarity, based on the frequency and similarity of existing /etc/mime.types entries.
Here, in the end, is what I cooked up:

#!/usr/bin/env python3

import mimetypes
import numpy as np
from gcloud import bigquery
from svmutil import *

# Load /etc/mime.types
mimetypes.init()

client = bigquery.Client()

# For each MIME type, count the number of requests
query = client.run_sync_query(
  '''
    SELECT LOWER(mimeType) AS mimeType, COUNT(*)
    FROM httparchive:runs.latest_requests
    WHERE mimeType <> ""
    GROUP BY mimeType
  ''',
)
query.run()

n_mime_type = dict(query.rows)

# For each file extension, count the number of requests
query = client.run_sync_query(
  '''
    SELECT LOWER(ext) AS ext, COUNT(*)
    FROM httparchive:runs.latest_requests
    WHERE ext <> ""
    GROUP BY ext
  ''',
)
query.run()

n_ext = dict(query.rows)

# For each MIME type and file extension pair,
# count the number of requests
query = client.run_sync_query(
  '''
    SELECT LOWER(mimeType) AS mimeType, LOWER(ext) AS ext, COUNT(*)
    FROM httparchive:runs.latest_requests
    WHERE mimeType <> ""
      AND ext <> ""
    GROUP BY mimeType, ext
  ''',
)
query.run()

x_train = []
y = []

x_predict = []

predict_mime_type = []
predict_ext = []

for mime_type, ext, n in query.rows:
  # For each MIME type and file extension pair, choose as features the
  # pair's frequency and the similarity to its MIME type and to its file
  # extension, in terms of number of requests
  x = (n, n / n_mime_type[mime_type], n / n_ext[ext])

  # Positive examples are where mimetypes already maps the pair's file
  # extension to its MIME type.  Negative examples are where mimetypes
  # maps the file extension to a different MIME type.  Otherwise if the
  # file extension is previously unknown, predict whether it belongs in
  # /etc/mime.types based on our model of existing pairs' frequencies
  # and similarities.
  value = mimetypes.types_map.get('.' + ext)
  if value is None:
    value = mimetypes.common_types.get('.' + ext)
    if value is None:
      x_predict.append(x)

      predict_mime_type.append(mime_type)
      predict_ext.append(ext)

      continue

  x_train.append(x)
  y.append(value == mime_type)

# Scale the features to [0,1]
x_train = np.array(x_train)
x_predict = np.array(x_predict)

x_min = x_train.min(axis=0)
x_max = x_train.max(axis=0)

x_train = (x_train - x_min) / (x_max - x_min)
x_predict = (x_predict - x_min) / (x_max - x_min)

x_train = x_train.tolist()
x_predict = x_predict.tolist()

# Apply regression
model = svm_train(y, x_train, '-s 3')

y = [0] * len(x_predict)

p_labels, p_acc, p_vals = svm_predict(y, x_predict, model)

# Sort by the regression result and print
width = list(map(len, predict_mime_type))
width = min(max(width), int(2 * np.median(width)))
for score, mime_type, ext in sorted(zip(p_labels, predict_mime_type, predict_ext)):
  print('{:f}  {:{}}  {}'.format(score, mime_type, width, ext))

Regression did automatically learn, from the existing /etc/mime.types entries, the positive correlations between frequency and similarity and good candidates, however the results are as good as I got through manual experimentation and not much better.
Here’s a sample of the output:

[...]
0.527885  text/vtt            vtt
0.533889  content/unknown     rez
0.583505  application/lrc     lrc
0.583505  text/prs.lines.tag  tag
0.689945  audio/x-aac         aac
0.757967  application/ttml+xml  dfxp
0.777684  application/smil    fmil
0.798960  image-png           ihax
0.893182  application/x-sdch-dictionary  sdch
0.907889  font/woff2          woff2
0.975929  application/x-protobuf  pbf
0.990286  model/vnd.collada+xml  dae
1.041315  application/xaml+xml  xaml
1.041315  text/x-script       mem
1.041315  .aff                aff
1.041315  application/x-silverlight-app  xap
1.041315  binary/octet        abst
1.041315  image/jpeg, image/gif  serv
1.041315  application/pgp-encrypted  htp
1.041315  application/fsb     fsb

It does successfully identify text/vtt as a candidate, however it’s buried after what I imagine are bad MIME types, like:

  • image/jpeg, image/gif
  • .aff
  • image-png

I conclude that, for the URLs in HTTP archive and the features I’ve chosen (frequencies and similarities), there’s no accurate way to distinguish between good /etc/mime.types candidates and popular pages with bad MIME types.
Can anyone suggest any improvements?