IANA keeps track of MIME types but not file extensions.
In the absence of an authoritative directory of file extensions, I thought I’d use HTTP Archive to check if /etc/mime.types
is up to date.
I got a list of MIME types and file extensions from httparchive:runs.latest_requests
easily enough, but quickly discovered it’s very noisy.
Undaunted, I tried using the frequency of requests for each MIME type and file extension to identify good /etc/mime.types
candidates.
My theory was that higher frequency would correlate to better candidates. Also the best file extension for a given MIME type would be the one with the most requests, and vice versa.
For a while I played around manually with various equations based on the frequency and similarity of requests for each MIME type and file extension.
Then I decided to apply regression to identify the “best” proportion of frequency and similarity, based on the frequency and similarity of existing /etc/mime.types
entries.
Here, in the end, is what I cooked up:
#!/usr/bin/env python3
import mimetypes
import numpy as np
from gcloud import bigquery
from svmutil import *
# Load /etc/mime.types
mimetypes.init()
client = bigquery.Client()
# For each MIME type, count the number of requests
query = client.run_sync_query(
'''
SELECT LOWER(mimeType) AS mimeType, COUNT(*)
FROM httparchive:runs.latest_requests
WHERE mimeType <> ""
GROUP BY mimeType
''',
)
query.run()
n_mime_type = dict(query.rows)
# For each file extension, count the number of requests
query = client.run_sync_query(
'''
SELECT LOWER(ext) AS ext, COUNT(*)
FROM httparchive:runs.latest_requests
WHERE ext <> ""
GROUP BY ext
''',
)
query.run()
n_ext = dict(query.rows)
# For each MIME type and file extension pair,
# count the number of requests
query = client.run_sync_query(
'''
SELECT LOWER(mimeType) AS mimeType, LOWER(ext) AS ext, COUNT(*)
FROM httparchive:runs.latest_requests
WHERE mimeType <> ""
AND ext <> ""
GROUP BY mimeType, ext
''',
)
query.run()
x_train = []
y = []
x_predict = []
predict_mime_type = []
predict_ext = []
for mime_type, ext, n in query.rows:
# For each MIME type and file extension pair, choose as features the
# pair's frequency and the similarity to its MIME type and to its file
# extension, in terms of number of requests
x = (n, n / n_mime_type[mime_type], n / n_ext[ext])
# Positive examples are where mimetypes already maps the pair's file
# extension to its MIME type. Negative examples are where mimetypes
# maps the file extension to a different MIME type. Otherwise if the
# file extension is previously unknown, predict whether it belongs in
# /etc/mime.types based on our model of existing pairs' frequencies
# and similarities.
value = mimetypes.types_map.get('.' + ext)
if value is None:
value = mimetypes.common_types.get('.' + ext)
if value is None:
x_predict.append(x)
predict_mime_type.append(mime_type)
predict_ext.append(ext)
continue
x_train.append(x)
y.append(value == mime_type)
# Scale the features to [0,1]
x_train = np.array(x_train)
x_predict = np.array(x_predict)
x_min = x_train.min(axis=0)
x_max = x_train.max(axis=0)
x_train = (x_train - x_min) / (x_max - x_min)
x_predict = (x_predict - x_min) / (x_max - x_min)
x_train = x_train.tolist()
x_predict = x_predict.tolist()
# Apply regression
model = svm_train(y, x_train, '-s 3')
y = [0] * len(x_predict)
p_labels, p_acc, p_vals = svm_predict(y, x_predict, model)
# Sort by the regression result and print
width = list(map(len, predict_mime_type))
width = min(max(width), int(2 * np.median(width)))
for score, mime_type, ext in sorted(zip(p_labels, predict_mime_type, predict_ext)):
print('{:f} {:{}} {}'.format(score, mime_type, width, ext))
Regression did automatically learn, from the existing /etc/mime.types
entries, the positive correlations between frequency and similarity and good candidates, however the results are as good as I got through manual experimentation and not much better.
Here’s a sample of the output:
[...]
0.527885 text/vtt vtt
0.533889 content/unknown rez
0.583505 application/lrc lrc
0.583505 text/prs.lines.tag tag
0.689945 audio/x-aac aac
0.757967 application/ttml+xml dfxp
0.777684 application/smil fmil
0.798960 image-png ihax
0.893182 application/x-sdch-dictionary sdch
0.907889 font/woff2 woff2
0.975929 application/x-protobuf pbf
0.990286 model/vnd.collada+xml dae
1.041315 application/xaml+xml xaml
1.041315 text/x-script mem
1.041315 .aff aff
1.041315 application/x-silverlight-app xap
1.041315 binary/octet abst
1.041315 image/jpeg, image/gif serv
1.041315 application/pgp-encrypted htp
1.041315 application/fsb fsb
It does successfully identify text/vtt
as a candidate, however it’s buried after what I imagine are bad MIME types, like:
image/jpeg, image/gif
.aff
image-png
I conclude that, for the URLs in HTTP archive and the features I’ve chosen (frequencies and similarities), there’s no accurate way to distinguish between good /etc/mime.types
candidates and popular pages with bad MIME types.
Can anyone suggest any improvements?