cat articles/similar-documents-cli

A CLI for finding similar documents in static site generators

created 2021-04-27

I recently wrote a small CLI for outputting related entries for secon.dev. While doing that, I realized that most static site generators produce HTML from article files written in Markdown or HTML, such as .md and .html, located somewhere on disk. That means the same idea could be useful outside my own site, so I released it as a CLI for finding similar documents.

https://github.com/hotchpotch/similar-documents-cli
- Install it with pip install -U similar-documents

If you pass the files whose related entries you want to infer, the CLI outputs, in JSON, the most related files for each input file. The article files for this site, secon.dev, are not public, so as an example I tried inferring related entries from the Markdown articles in the source code for r7kamura.com, which r7kamura publishes.

$ time similar-documents --debug -k 3 -t japanese ~/src/github.com/r7kamura/r7kamura.com/articles/*.md > r7kamura_com_similar_articles.json
files to texts 951 documents
calc tfidf...
calc similarity...
assign similarity score
similar-documents --debug -k 3 -t japanese  >   8.03s user 3.98s system 383% cpu 3.131 total

It took about 3.1 seconds to infer related articles for 951 posts on my Ryzen 3900X environment. The JSON includes file paths from my machine, so I cleaned it up a little. The hash keys are article paths, and each entry array contains related articles in descending score order.

cat r7kamura_com_similar_articles.json | jq . | sd '/home/yu1/src/github.com/r7kamura/' 'https://' |sd '.md"' '"' > converted.json
cat converted.json

Here are a few excerpts from the JSON. For an article about duct rails, other duct-rail-related articles appear with high scores.

  "https://r7kamura.com/articles/2021-02-05-switchbot-hub-mini-on-rails": [
    [
      "https://r7kamura.com/articles/2020-12-19-google-home-mini-on-rails",
      0.6502251932677562
    ],
    [
      "https://r7kamura.com/articles/2021-01-18-nature-remo-on-rails",
      0.6088665752039284
    ],
    [
      "https://r7kamura.com/articles/2016-12-12-h",
      0.33070364498269256
    ]
  ],

For an article about the game Atelier Ryza, another Ryza article and an FF13 article appear. Looking at the scores, the other Ryza article is clearly the closest one.

  "https://r7kamura.com/articles/2021-02-13-atelier-ryza": [
    [
      "https://r7kamura.com/articles/2020-01-19-atelier-ryza",
      0.4632359977711961
    ],
    [
      "https://r7kamura.com/articles/2020-12-31-games-2020",
      0.17984491640184092
    ],
    [
      "https://r7kamura.com/articles/2021-01-30-final-fantasy-13",
      0.15056225381780178
    ]
  ],

For an article about bathtub cleaning, articles about bathtub detergent and drain cleaning are inferred as related.

  "https://r7kamura.com/articles/2021-02-19-laundry-cleaning": [
    [
      "https://r7kamura.com/articles/2020-11-02-lookplus",
      0.39103024934082137
    ],
    [
      "https://r7kamura.com/articles/2020-10-12-ember-restored",
      0.3759286934329018
    ],
    [
      "https://r7kamura.com/articles/2014-08-31-h",
      0.33743028929351304
    ]
  ],

With output like this, a single command can generate JSON for related entries. If a static site build reads that JSON, related-article features should be fairly easy to add to static site generators.

If you have persistent compute resources, something like Elasticsearch's More like this should produce more accurate related entries. But for static site generators, there is value in a command that can run casually at build time without maintaining any external state.

Technical Notes

Nothing complicated is happening. It uses the kind of document similarity method that appears in introductory machine learning material: count terms, calculate TF-IDF, and find similar documents by cosine similarity. For Japanese tokenization it uses MeCab through fugashi, a Python wrapper that is easy to use and makes dictionaries easy to install. TF-IDF and cosine similarity are handled entirely by scikit-learn. It is a classical method, but in practice it gives fairly reasonable related articles.

At the moment, .md and .html files are converted to text through parsers for their formats, and all other files are treated as plain text. In TF-IDF, terms that appear across many documents receive lower scores. So if every file uses the same particular format, words specific to that format should have limited effect on the score, even though converting to clean text is still preferable. That is why this simple approach seems to work reasonably well.