Mass proofreading with LanguageTool

I’ve experimented a bit with using LanguageTool (FOSS proofreading) CLI version to detect errors in items descriptions. So far I like it, got rid of most false positives by disabling unnecessary rules. Tested on items/tools.json and tool_armor.json. About a dozen errors just in these two files were found, mostly wrong articles.

I’ve found tools/json_tools/util.py that may help getting all the texts to check them.

But first I’d like some feedback on the general idea. Should I (or we if someone wants to help, see below) proceed and make some translators work a bit more?

Currently I just dumped all texts in one stream and feed it to LT, no corrections in-place so in current state changes need to be applied manually.

Proof of concept code:

import argparse
import json


PARSER = argparse.ArgumentParser(description='')
PARSER.add_argument(
    '-f', '--filename', required=True, help='')
ARGS = PARSER.parse_args()

with open(ARGS.filename) as filehandler:
    data = json.load(filehandler)

for item in data:
    line = item.get('description', None)
    if isinstance(line, dict):
        line = line.get('description')
    print(line, end='\n\n')

python3 ~/cdda/texts.py -f in.json | java -jar languagetool-commandline.jar -l en --json -d EN_QUOTES,ENGLISH_WORD_REPEAT_BEGINNING_RULE,CD_NN,DASH_RULE,USE_TO_VERB,THE_WORSE_OF,PUNCTUATION_PARAGRAPH_END --disablecategories REDUNDANCY,TYPOGRAPHY,STYLE - | python3 -m json.tool > ~/tmpfs/out.json

#!/bin/bash
for file in `find "$1" -name "*.json"`; do
  echo $file
  (python3 ~/cdda/texts.py -f $file | java -jar languagetool-commandline.jar -l en --json -d EN_QUOTES,ENGLISH_WORD_REPEAT_BEGINNING_RULE,CD_NN,DASH_RULE,USE_TO_VERB,THE_WORSE_OF,PUNCTUATION_PARAGRAPH_END,MASS_AGREEMENT,UNIT_SPACE,EN_DIACRITICS_REPLACE,WORD_CONTAINS_UNDERSCORE --disablecategories REDUNDANCY,TYPOGRAPHY,STYLE - | python3 -m json.tool > /tmpfs/out.json) && (
    (echo "grepping"; grep -q "\"matches\": \[\]" /tmpfs/out.json ) && (echo "continuing"; continue) ||\
    (
      echo "matches found"
      $EDITOR $file &
      $EDITOR /tmpfs/out.json
    )
  )
done

Sure, make pull requests with suggested changes to fix typos and errors.

We don’t want to regress translations right before a release, but we can merge them right afterwards.

Many suggestions were left out, tried fixing only the ones I was sure about.
Most numerous I haven’t checked yet:

  • adding space between measurement and units in ammo (“9 mm” vs “9mm”)
  • adding diacritics in foreign words
  • adding commas after “Typically”, “Still”, etc.
  • making invisible-in-game descriptions all start with upper case just to appease LT

For anyone interested: this PR https://github.com/CleverRaven/Cataclysm-DDA/pull/42585 was done by pasting python3 table.py -f csv description output into LT Desktop GUI with some mass auto-replacements like removing blank lines. It took a considerable amount of time and several GiB of RAM to process all descriptions at once.