Halo, ini artikel ketiga saya di Habrรฉ, sebelumnya saya menulis artikel tentang model bahasa ALM . Sekarang, saya ingin memperkenalkan Anda pada sistem koreksi kesalahan ketik ASC (diimplementasikan atas dasar ALM ).
Ya, ada banyak sekali sistem untuk mengoreksi kesalahan ketik, semuanya memiliki kekuatan dan kelemahan masing-masing, dari sistem terbuka saya dapat memilih salah satu JamSpell yang paling menjanjikan , dan kami akan membandingkannya. Ada juga sistem serupa dari DeepPavlov , yang mungkin dipikirkan banyak orang, tetapi saya tidak pernah berteman dengannya.
Daftar fitur:
- Koreksi kesalahan dalam kata-kata dengan perbedaan hingga 4 jarak Levenshtein.
- Koreksi kesalahan ketik pada kata-kata (penyisipan, penghapusan, penggantian, penataan ulang) karakter.
- fikasi mengingat konteksnya.
- Menempatkan kasus huruf pertama dari kata untuk (nama dan gelar yang tepat), dengan mempertimbangkan konteksnya.
- Memisahkan kata yang digabungkan menjadi kata-kata terpisah, dengan mempertimbangkan konteksnya.
- Melakukan analisis teks tanpa mengoreksi teks asli.
- Cari di teks untuk keberadaan (kesalahan, kesalahan ketik, konteks yang salah).
Sistem operasi yang didukung:
- MacOS X
- FreeBSD
- Linux
Sistem ini ditulis dalam C ++ 11, ada port untuk Python3
Kamus siap
| Nama | Ukuran (GB) | RAM (GB) | Ukuran N-gram | Bahasa |
|---|---|---|---|---|
| wittenbell-3-big.asc | 1.97 | 15.6 | 3 | RU |
| wittenbell-3-middle.asc | 1.24 | 9.7 | 3 | RU |
| mkneserney-3-middle.asc | 1.33 | 9.7 | 3 | RU |
| wittenbell-3-single.asc | 0.772 | 5.14 | 3 | RU |
| wittenbell-5-single.asc | 1.37 | 10.7 | lima | RU |
Menguji
Data dari kompetisi "koreksi kesalahan ketik" Dialog21 2016 digunakan untuk menguji sistem . Kamus biner terlatih digunakan untuk pengujian: wittenbell-3-middle.asc
| Tes dilakukan | Presisi | Penarikan | Ukur |
|---|---|---|---|
| Mode koreksi salah ketik | 76.97 | 62.71 | 69.11 |
| Mode koreksi kesalahan | 73.72 | 60.53 | 66.48 |
Saya pikir tidak perlu menambahkan data lain, jika diinginkan, semua orang dapat mengulangi tes, saya lampirkan semua materi yang digunakan dalam pengujian di bawah ini.
Bahan yang digunakan dalam pengujian
- test.txt - Teks untuk diuji
- correct.txt - Teks dari varian yang benar
- evalu.py - skrip Python3 untuk menghitung hasil koreksi
Sekarang, menarik untuk membandingkan kerja sistem untuk mengoreksi kesalahan ketik itu sendiri dalam kondisi yang sama, kami akan melatih dua kesalahan ketik yang berbeda pada data teks yang sama dan melakukan tes.
Sebagai perbandingan, mari kita ambil sistem koreksi kesalahan ketik yang saya sebutkan di atas, JamSpell .
ASC vs JamSpell
Instalasi
ASC
JamSpell
$ git clone --recursive https://github.com/anyks/asc.git
$ cd ./asc
$ mkdir ./build
$ cd ./build
$ cmake ..
$ make
JamSpell
$ git clone https://github.com/bakwc/JamSpell.git
$ cd ./JamSpell
$ mkdir ./build
$ cd ./build
$ cmake ..
$ make
Latihan
ASC
train.json
Python3
JamSpell
train.json
{
"ext": "txt",
"size": 3,
"alter": {"":""},
"debug": 1,
"threads": 0,
"method": "train",
"allow-unk": true,
"reset-unk": true,
"confidence": true,
"interpolate": true,
"mixed-dicts": true,
"only-token-words": true,
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"pilots": ["","","","","","","","","","","a","i","o","e","g"],
"corpus": "./texts/correct.txt",
"w-bin": "./dictionary/3-middle.asc",
"w-vocab": "./train/lm.vocab",
"w-arpa": "./train/lm.arpa",
"mix-restwords": "./similars/letters.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz",
"bin-code": "ru",
"bin-name": "Russian",
"bin-author": "You name",
"bin-copyright": "You company LLC",
"bin-contacts": "site: https://example.com, e-mail: info@example.com",
"bin-lictype": "MIT",
"bin-lictext": "... License text ...",
"embedding-size": 28,
"embedding": {
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}
}
$ ./asc -r-json ./train.json
Python3
import asc
asc.setSize(3)
asc.setAlmV2()
asc.setThreads(0)
asc.setLocale("en_US.UTF-8")
asc.setOption(asc.options_t.uppers)
asc.setOption(asc.options_t.allowUnk)
asc.setOption(asc.options_t.resetUnk)
asc.setOption(asc.options_t.mixDicts)
asc.setOption(asc.options_t.tokenWords)
asc.setOption(asc.options_t.confidence)
asc.setOption(asc.options_t.interpolate)
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
def statusArpa1(status):
print("Build arpa", status)
def statusArpa2(status):
print("Write arpa", status)
def statusVocab(status):
print("Write vocab", status)
def statusIndex(text, status):
print(text, status)
def status(text, status):
print(text, status)
asc.collectCorpus("./texts/correct.txt", asc.smoothing_t.wittenBell, 0.0, False, False, status)
asc.buildArpa(statusArpa1)
asc.writeArpa("./train/lm.arpa", statusArpa2)
asc.writeVocab("./train/lm.vocab", statusVocab)
asc.setCode("RU")
asc.setLictype("MIT")
asc.setName("Russian")
asc.setAuthor("You name")
asc.setCopyright("You company LLC")
asc.setLictext("... License text ...")
asc.setContacts("site: https://example.com, e-mail: info@example.com")
asc.setEmbedding({
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)
asc.saveIndex("./dictionary/3-middle.asc", "", 128, statusIndex)
JamSpell
$ ./main/jamspell train ../test_data/alphabet_ru.txt ../test_data/correct.txt ./model.bin
Menguji
ASC
spell.json
Python3
JamSpell
- Python , C++
spell.json
{
"debug": 1,
"threads": 0,
"method": "spell",
"spell-verbose": true,
"confidence": true,
"mixed-dicts": true,
"asc-split": true,
"asc-alter": true,
"asc-esplit": true,
"asc-rsplit": true,
"asc-uppers": true,
"asc-hyphen": true,
"asc-wordrep": true,
"r-text": "./texts/test.txt",
"w-text": "./texts/output.txt",
"r-bin": "./dictionary/3-middle.asc"
}
$ ./asc -r-json ./spell.json
Python3
import asc
asc.setAlmV2()
asc.setThreads(0)
asc.setOption(asc.options_t.uppers)
asc.setOption(asc.options_t.ascSplit)
asc.setOption(asc.options_t.ascAlter)
asc.setOption(asc.options_t.ascESplit)
asc.setOption(asc.options_t.ascRSplit)
asc.setOption(asc.options_t.ascUppers)
asc.setOption(asc.options_t.ascHyphen)
asc.setOption(asc.options_t.ascWordRep)
asc.setOption(asc.options_t.mixDicts)
asc.setOption(asc.options_t.confidence)
def status(text, status):
print(text, status)
asc.loadIndex("./dictionary/3-middle.asc", "", status)
f1 = open('./texts/test.txt')
f2 = open('./texts/output.txt', 'w')
for line in f1.readlines():
res = asc.spell(line)
f2.write("%s\n" % res[0])
f2.close()
f1.close()
JamSpell
- Python , C++
#include <fstream>
#include <iostream>
#include <jamspell/spell_corrector.hpp>
// BOOST
#ifdef USE_BOOST_CONVERT
#include <boost/locale/encoding_utf.hpp>
//
#else
#include <codecvt>
#endif
using namespace std;
/**
* convert utf-8
* @param str utf-8
* @return
*/
const string convert(const wstring & str){
//
string result = "";
//
if(!str.empty()){
// BOOST
#ifdef USE_BOOST_CONVERT
//
using boost::locale::conv::utf_to_utf;
// utf-8
result = utf_to_utf <char> (str.c_str(), str.c_str() + str.size());
//
#else
// UTF-8
using convert_type = codecvt_utf8 <wchar_t, 0x10ffff, little_endian>;
//
wstring_convert <convert_type, wchar_t> conv;
// wstring_convert <codecvt_utf8 <wchar_t>> conv;
// utf-8
result = conv.to_bytes(str);
#endif
}
//
return result;
}
/**
* convert utf-8
* @param str
* @return utf-8
*/
const wstring convert(const string & str){
//
wstring result = L"";
//
if(!str.empty()){
// BOOST
#ifdef USE_BOOST_CONVERT
//
using boost::locale::conv::utf_to_utf;
// utf-8
result = utf_to_utf <wchar_t> (str.c_str(), str.c_str() + str.size());
//
#else
//
// wstring_convert <codecvt_utf8 <wchar_t>> conv;
wstring_convert <codecvt_utf8_utf16 <wchar_t, 0x10ffff, little_endian>> conv;
// utf-8
result = conv.from_bytes(str);
#endif
}
//
return result;
}
/**
* safeGetline
* @param is
* @param t
* @return
*/
istream & safeGetline(istream & is, string & t){
//
t.clear();
istream::sentry se(is, true);
streambuf * sb = is.rdbuf();
for(;;){
int c = sb->sbumpc();
switch(c){
case '\n': return is;
case '\r':
if(sb->sgetc() == '\n') sb->sbumpc();
return is;
case streambuf::traits_type::eof():
if(t.empty()) is.setstate(ios::eofbit);
return is;
default: t += (char) c;
}
}
}
/**
* main
*/
int main(){
//
NJamSpell::TSpellCorrector corrector;
//
corrector.LoadLangModel("model.bin");
//
ifstream file1("./test_data/test.txt", ios::in);
//
if(file1.is_open()){
//
string line = "", res = "";
//
ofstream file2("./test_data/output.txt", ios::out);
//
if(file2.is_open()){
//
while(file1.good()){
//
safeGetline(file1, line);
// ,
if(!line.empty()){
//
res = convert(corrector.FixFragment(convert(line)));
// ,
if(!res.empty()){
//
res.append("\n");
//
file2.write(res.c_str(), res.size());
}
}
}
//
file2.close();
}
//
file1.close();
}
return 0;
}
$ g++ -std=c++11 -I../JamSpell -L./build/jamspell -L./build/contrib/cityhash -L./build/contrib/phf -ljamspell_lib -lcityhash -lphf ./test.cpp -o ./bin/test
$ ./bin/test
hasil
Mendapatkan hasil
$ python3 evaluate.py ./texts/test.txt ./texts/correct.txt ./texts/output.txt
ASC
| Presisi | Penarikan | Ukur |
|---|---|---|
| 92.13 | 82.51 | 87.05 |
JamSpell
| Presisi | Penarikan | Ukur |
|---|---|---|
| 77.87 | 63.36 | 69.87 |
Salah satu fitur utama ASC adalah belajar dari data kotor. Praktis tidak mungkin menemukan corpora teks tanpa kesalahan dan kesalahan ketik dalam akses terbuka. Hidup tidak cukup untuk memperbaiki terabyte data dengan tangan, tetapi Anda harus mengatasinya.
Prinsip pengajaran yang saya tawarkan
- Menyusun model bahasa menggunakan data kotor
- Kami menghapus semua kata langka dan N-gram dalam model bahasa yang dirakit
- Kami menambahkan satu kata untuk pengoperasian yang lebih benar dari sistem koreksi kesalahan ketik.
- Menyusun kamus biner
Mari kita mulai
Misalkan kita memiliki beberapa korpus dari subjek yang berbeda, lebih logis untuk melatihnya secara terpisah, kemudian menggabungkannya.
Merakit sasis menggunakan ALM
collect.json
Python
,
{
"size": 3,
"debug": 1,
"threads": 0,
"ext": "txt",
"method": "train",
"allow-unk": true,
"mixed-dicts": true,
"only-token-words": true,
"smoothing": "wittenbell",
"locale": "en_US.UTF-8",
"w-abbr": "./output/alm.abbr",
"w-map": "./output/alm.map",
"w-vocab": "./output/alm.vocab",
"w-words": "./output/words.txt",
"corpus": "./texts/corpus",
"abbrs": "./abbrs/abbrs.txt",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"mix-restwords": "./texts/similars/letters.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz"
}
$ ./alm -r-json ./collect.json
- size โ N- 3
- debug โ
- threads โ
- ext โ
- allow-unk โ โฉunkโช
- mixed-dicts โ
- only-token-words โ N- โ
- smoothing โ wittenbell ( , - )
- locale โ ( )
- w-abbr โ
- w-map โ
- w-vocab โ
- w-words โ ( )
- corpus โ
- abbrs โ , , (, , ...)
- goodwords โ
- badwords โ
- mix-restwords โ
- alphabet โ ( )
Python
import alm
# N- 3
alm.setSize(3)
#
alm.setThreads(0)
# ( )
alm.setLocale("en_US.UTF-8")
# ( )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
# <unk>
alm.setOption(alm.options_t.allowUnk)
#
alm.setOption(alm.options_t.mixDicts)
# N- โ
alm.setOption(alm.options_t.tokenWords)
# wittenbell ( , - )
alm.init(alm.smoothing_t.wittenBell)
# , , (, , ...)
f = open('./abbrs/abbrs.txt')
for abbr in f.readlines():
abbr = abbr.replace("\n", "")
alm.addAbbr(abbr)
f.close()
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addBadword(word)
f.close()
def status(text, status):
print(text, status)
def statusWords(status):
print("Write words", status)
def statusVocab(status):
print("Write vocab", status)
def statusMap(status):
print("Write map", status)
def statusSuffix(status):
print("Write suffix", status)
#
alm.collectCorpus("./texts/corpus", status)
#
alm.writeWords("./output/words.txt", statusWords)
#
alm.writeVocab("./output/alm.vocab", statusVocab)
#
alm.writeMap("./output/alm.map", statusMap)
#
alm.writeSuffix("./output/alm.abbr", statusSuffix)
,
Pemangkasan Hull Rakitan dengan ALM
prune.json
Python
{
"size": 3,
"debug": 1,
"allow-unk": true,
"method": "vprune",
"vprune-wltf": -15.0,
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"r-map": "./corpus1/alm.map",
"r-vocab": "./corpus1/alm.vocab",
"w-map": "./output/alm.map",
"w-vocab": "./output/alm.vocab",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz"
}
$ ./alm -r-json ./prune.json
- size โ N- 3
- debug โ
- allow-unk โ โฉunkโช
- vprune-wltf โ - (, โ )
- locale โ ( )
- smoothing โ wittenbell ( , - )
- r-map โ
- r-vocab โ
- w-map โ
- w-vocab โ
- goodwords โ
- badwords โ
- alphabet โ ( )
Python
import alm
# N- 3
alm.setSize(3)
#
alm.setThreads(0)
# ( )
alm.setLocale("en_US.UTF-8")
# ( )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
# <unk>
alm.setOption(alm.options_t.allowUnk)
# wittenbell ( , - )
alm.init(alm.smoothing_t.wittenBell)
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addBadword(word)
f.close()
def statusPrune(status):
print("Prune data", status)
def statusReadVocab(text, status):
print("Read vocab", text, status)
def statusWriteVocab(status):
print("Write vocab", status)
def statusReadMap(text, status):
print("Read map", text, status)
def statusWriteMap(status):
print("Write map", status)
#
alm.readVocab("./corpus1/alm.vocab", statusReadVocab)
#
alm.readMap("./corpus1/alm.map", statusReadMap)
#
alm.pruneVocab(-15.0, 0, 0, statusPrune)
#
alm.writeVocab("./output/alm.vocab", statusWriteVocab)
#
alm.writeMap("./output/alm.map", statusWriteMap)
Menggabungkan data yang dikumpulkan dengan ALM
merge.json
Python
{
"size": 3,
"debug": 1,
"allow-unk": true,
"method": "merge",
"mixed-dicts": "true",
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"r-words": "./texts/words",
"r-map": "./corpus1",
"r-vocab": "./corpus1",
"w-map": "./output/alm.map",
"w-vocab": "./output/alm.vocab",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"mix-restwords": "./texts/similars/letters.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz"
}
$ ./alm -r-json ./merge.json
- size โ N- 3
- debug โ
- allow-unk โ โฉunkโช
- mixed-dicts โ
- locale โ ( )
- smoothing โ wittenbell ( , - )
- r-words โ
- r-map โ ,
- r-vocab โ ,
- w-map โ
- w-vocab โ
- goodwords โ
- badwords โ
- alphabet โ ( )
Python
import alm
# N- 3
alm.setSize(3)
#
alm.setThreads(0)
# ( )
alm.setLocale("en_US.UTF-8")
# ( )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
# <unk>
alm.setOption(alm.options_t.allowUnk)
#
alm.setOption(alm.options_t.mixDicts)
# wittenbell ( , - )
alm.init(alm.smoothing_t.wittenBell)
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addBadword(word)
f.close()
#
f = open('./texts/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addWord(word)
f.close()
def statusReadVocab(text, status):
print("Read vocab", text, status)
def statusWriteVocab(status):
print("Write vocab", status)
def statusReadMap(text, status):
print("Read map", text, status)
def statusWriteMap(status):
print("Write map", status)
#
alm.readVocab("./corpus1", statusReadVocab)
#
alm.readMap("./corpus1", statusReadMap)
#
alm.writeVocab("./output/alm.vocab", statusWriteVocab)
#
alm.writeMap("./output/alm.map", statusWriteMap)
Mempelajari model bahasa dengan ALM
train.json
Python
{
"size": 3,
"debug": 1,
"allow-unk": true,
"reset-unk": true,
"interpolate": true,
"method": "train",
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"r-map": "./output/alm.map",
"r-vocab": "./output/alm.vocab",
"w-arpa": "./output/alm.arpa",
"w-words": "./output/words.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz"
}
$ ./alm -r-json ./train.json
- size โ N- 3
- debug โ
- allow-unk โ โฉunkโช
- reset-unk โ , โฉunkโช
- interpolate โ
- locale โ ( )
- smoothing โ wittenbell
- r-map โ ,
- r-vocab โ ,
- w-arpa โ ARPA,
- w-words โ , ( )
- alphabet โ ( )
Python
import alm
# N- 3
alm.setSize(3)
#
alm.setThreads(0)
# ( )
alm.setLocale("en_US.UTF-8")
# ( )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
# <unk>
alm.setOption(alm.options_t.allowUnk)
# <unk>
alm.setOption(alm.options_t.resetUnk)
#
alm.setOption(alm.options_t.mixDicts)
#
alm.setOption(alm.options_t.interpolate)
# wittenbell ( , - )
alm.init(alm.smoothing_t.wittenBell)
def statusReadVocab(text, status):
print("Read vocab", text, status)
def statusReadMap(text, status):
print("Read map", text, status)
def statusBuildArpa(status):
print("Build ARPA", status)
def statusWriteMap(status):
print("Write map", status)
def statusWriteArpa(status):
print("Write ARPA", status)
def statusWords(status):
print("Write words", status)
#
alm.readVocab("./output/alm.vocab", statusReadVocab)
#
alm.readMap("./output/alm.map", statusReadMap)
#
alm.buildArpa(statusBuildArpa)
# ARPA
alm.writeArpa("./output/alm.arpa", statusWriteArpa)
#
alm.writeWords("./output/words.txt", statusWords)
Pelatihan ASC pemeriksa ejaan
train.json
Python
{
"size": 3,
"debug": 1,
"threads": 0,
"confidence": true,
"mixed-dicts": true,
"method": "train",
"alter": {"":""},
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"pilots": ["","","","","","","","","","","a","i","o","e","g"],
"w-bin": "./dictionary/3-single.asc",
"r-abbr": "./output/alm.abbr",
"r-vocab": "./output/alm.vocab",
"r-arpa": "./output/alm.arpa",
"abbrs": "./texts/abbrs/abbrs.txt",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"alters": "./texts/alters/yoficator.txt",
"upwords": "./texts/words/upp",
"mix-restwords": "./texts/similars/letters.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz",
"bin-code": "ru",
"bin-name": "Russian",
"bin-author": "You name",
"bin-copyright": "You company LLC",
"bin-contacts": "site: https://example.com, e-mail: info@example.com",
"bin-lictype": "MIT",
"bin-lictext": "... License text ...",
"embedding-size": 28,
"embedding": {
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}
}
$ ./asc -r-json ./train.json
- size โ N- 3
- debug โ
- threads โ
- confidence โ ARPA - ,
- mixed-dicts โ
- alter โ ( , , โ ยซยป)
- locale โ ( )
- smoothing โ wittenbell ( , - )
- pilots โ ( )
- w-bin โ
- r-abbr โ ,
- r-vocab โ ,
- r-arpa โ ARPA,
- abbrs โ , , (, , ...)
- goodwords โ
- badwords โ
- alters โ , ( )
- upwords โ , (, , ...)
- mix-restwords โ
- alphabet โ ( )
- bin-code โ
- bin-name โ
- bin-author โ
- bin-copyright โ
- bin-contacts โ
- bin-lictype โ
- bin-lictext โ
- embedding-size โ
- embedding โ ( , )
Python
import asc
# N- 3
asc.setSize(3)
#
asc.setThreads(0)
# ( )
asc.setLocale("en_US.UTF-8")
#
asc.setOption(asc.options_t.uppers)
# <unk>
asc.setOption(asc.options_t.allowUnk)
# <unk>
asc.setOption(asc.options_t.resetUnk)
#
asc.setOption(asc.options_t.mixDicts)
# ARPA - ,
asc.setOption(asc.options_t.confidence)
# ( )
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
# ( )
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
#
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addBadword(word)
f.close()
#
f = open('./output/alm.abbr')
for word in f.readlines():
word = word.replace("\n", "")
asc.addSuffix(word)
f.close()
# , (, , ...)
f = open('./texts/abbrs/abbrs.txt')
for abbr in f.readlines():
abbr = abbr.replace("\n", "")
asc.addAbbr(abbr)
f.close()
# , (, , ...)
f = open('./texts/words/upp/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addUWord(word)
f.close()
#
asc.addAlt("", "")
# , ( )
f = open('./texts/alters/yoficator.txt')
for words in f.readlines():
words = words.replace("\n", "")
words = words.split('\t')
asc.addAlt(words[0], words[1])
f.close()
def statusIndex(text, status):
print(text, status)
def statusBuildIndex(status):
print("Build index", status)
def statusArpa(status):
print("Read arpa", status)
def statusVocab(status):
print("Read vocab", status)
# ARPA
asc.readArpa("./output/alm.arpa", statusArpa)
#
asc.readVocab("./output/alm.vocab", statusVocab)
#
asc.setCode("RU")
#
asc.setLictype("MIT")
#
asc.setName("Russian")
#
asc.setAuthor("You name")
#
asc.setCopyright("You company LLC")
#
asc.setLictext("... License text ...")
#
asc.setContacts("site: https://example.com, e-mail: info@example.com")
# ( , )
asc.setEmbedding({
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)
#
asc.buildIndex(statusBuildIndex)
#
asc.saveIndex("./dictionary/3-middle.asc", "", 128, statusIndex)
Saya memahami bahwa tidak setiap orang dapat melatih kosakata biner mereka sendiri; ini membutuhkan corpora teks dan sumber daya komputasi yang signifikan. Oleh karena itu, ASC hanya mampu bekerja dengan satu file ARPA sebagai kamus utama.
Contoh pekerjaan
spell.json
Python
{
"ad": 13,
"cw": 38120,
"debug": 1,
"threads": 0,
"method": "spell",
"alter": {"":""},
"asc-split": true,
"asc-alter": true,
"confidence": true,
"asc-esplit": true,
"asc-rsplit": true,
"asc-uppers": true,
"asc-hyphen": true,
"mixed-dicts": true,
"asc-wordrep": true,
"spell-verbose": true,
"r-text": "./texts/test.txt",
"w-text": "./texts/output.txt",
"upwords": "./texts/words/upp",
"r-arpa": "./dictionary/alm.arpa",
"r-abbr": "./dictionary/alm.abbr",
"abbrs": "./texts/abbrs/abbrs.txt",
"alters": "./texts/alters/yoficator.txt",
"mix-restwords": "./similars/letters.txt",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"pilots": ["","","","","","","","","","","a","i","o","e","g"],
"alphabet": "abcdefghijklmnopqrstuvwxyz",
"embedding-size": 28,
"embedding": {
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}
}
$ ./asc -r-json ./spell.json
Python
import asc
#
asc.setThreads(0)
#
asc.setOption(asc.options_t.uppers)
#
asc.setOption(asc.options_t.ascSplit)
#
asc.setOption(asc.options_t.ascAlter)
#
asc.setOption(asc.options_t.ascESplit)
#
asc.setOption(asc.options_t.ascRSplit)
#
asc.setOption(asc.options_t.ascUppers)
#
asc.setOption(asc.options_t.ascHyphen)
#
asc.setOption(asc.options_t.ascWordRep)
#
asc.setOption(asc.options_t.mixDicts)
# ARPA - ,
asc.setOption(asc.options_t.confidence)
# ( )
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
# ( )
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
#
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addBadword(word)
f.close()
#
f = open('./output/alm.abbr')
for word in f.readlines():
word = word.replace("\n", "")
asc.addSuffix(word)
f.close()
# , (, , ...)
f = open('./texts/abbrs/abbrs.txt')
for abbr in f.readlines():
abbr = abbr.replace("\n", "")
asc.addAbbr(abbr)
f.close()
# , (, , ...)
f = open('./texts/words/upp/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addUWord(word)
f.close()
#
asc.addAlt("", "")
# , ( )
f = open('./texts/alters/yoficator.txt')
for words in f.readlines():
words = words.replace("\n", "")
words = words.split('\t')
asc.addAlt(words[0], words[1])
f.close()
def statusArpa(status):
print("Read arpa", status)
def statusIndex(status):
print("Build index", status)
# ARPA
asc.readArpa("./dictionary/alm.arpa", statusArpa)
# (38120 13 )
asc.setAdCw(38120, 13)
# ( , )
asc.setEmbedding({
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)
#
asc.buildIndex(statusIndex)
f1 = open('./texts/test.txt')
f2 = open('./texts/output.txt', 'w')
for line in f1.readlines():
res = asc.spell(line)
f2.write("%s\n" % res[0])
f2.close()
f1.close()
PS Bagi yang tidak ingin mengumpulkan dan melatih sama sekali, saya bawa ASC versi web . Perlu juga diingat bahwa sistem untuk mengoreksi kesalahan ketik bukanlah sistem yang mahatahu dan tidak mungkin untuk memberi makan seluruh bahasa Rusia di sana. ASC tidak akan mengoreksi teks apa pun, perlu dilatih secara terpisah untuk setiap topik.