I’ve experimented a bit with using LanguageTool (FOSS proofreading) CLI version to detect errors in items descriptions. So far I like it, got rid of most false positives by disabling unnecessary rules. Tested on items/tools.json and tool_armor.json. About a dozen errors just in these two files were found, mostly wrong articles.
I’ve found tools/json_tools/util.py that may help getting all the texts to check them.
But first I’d like some feedback on the general idea. Should I (or we if someone wants to help, see below) proceed and make some translators work a bit more?
Currently I just dumped all texts in one stream and feed it to LT, no corrections in-place so in current state changes need to be applied manually.
Proof of concept code:
import argparse
import json
PARSER = argparse.ArgumentParser(description='')
PARSER.add_argument(
'-f', '--filename', required=True, help='')
ARGS = PARSER.parse_args()
with open(ARGS.filename) as filehandler:
data = json.load(filehandler)
for item in data:
line = item.get('description', None)
if isinstance(line, dict):
line = line.get('description')
print(line, end='\n\n')
python3 ~/cdda/texts.py -f in.json | java -jar languagetool-commandline.jar -l en --json -d EN_QUOTES,ENGLISH_WORD_REPEAT_BEGINNING_RULE,CD_NN,DASH_RULE,USE_TO_VERB,THE_WORSE_OF,PUNCTUATION_PARAGRAPH_END --disablecategories REDUNDANCY,TYPOGRAPHY,STYLE - | python3 -m json.tool > ~/tmpfs/out.json
#!/bin/bash
for file in `find "$1" -name "*.json"`; do
echo $file
(python3 ~/cdda/texts.py -f $file | java -jar languagetool-commandline.jar -l en --json -d EN_QUOTES,ENGLISH_WORD_REPEAT_BEGINNING_RULE,CD_NN,DASH_RULE,USE_TO_VERB,THE_WORSE_OF,PUNCTUATION_PARAGRAPH_END,MASS_AGREEMENT,UNIT_SPACE,EN_DIACRITICS_REPLACE,WORD_CONTAINS_UNDERSCORE --disablecategories REDUNDANCY,TYPOGRAPHY,STYLE - | python3 -m json.tool > /tmpfs/out.json) && (
(echo "grepping"; grep -q "\"matches\": \[\]" /tmpfs/out.json ) && (echo "continuing"; continue) ||\
(
echo "matches found"
$EDITOR $file &
$EDITOR /tmpfs/out.json
)
)
done