Ada berapa turis asing di kota Anda? Di saya, ada sedikit, tetapi ada, sebagai aturan, mereka tersesat di tengah jalan dan mereka mengulangi satu kata - nama apapun. Dan orang yang lewat mencoba menjelaskan dengan jari mereka ke mana harus pergi, dan ketika "milikku milikmu untuk tidak mengerti" - mereka memegang tangan dan menuntun mereka ke tujuan. Anehnya, biasanya target berada dalam jarak lima menit berjalan kaki, mis. para turis ini masih memiliki gambaran kasar tentang kota itu. Mungkin mereka dipandu oleh peta kertas.
Seberapa sering Anda secara pribadi menemukan diri Anda dalam situasi seperti itu, di kota yang tidak dikenal di negara lain?
Munculnya smartphone dan aplikasi navigasi telah memecahkan banyak masalah. Hore, Anda dapat melihat geolokasi Anda, Anda dapat menemukan ke mana harus pergi, memperkirakan ke arah mana dan bahkan merencanakan rute.
Hanya ada satu masalah tersisa: semua jalan di aplikasi ditandatangani dengan hieroglif lokal dalam dialek lokal, dan oke, jika alfabet Latin diadopsi di negara tuan rumah, ada keyboard Latin di semua ponsel cerdas dan dunia digunakan untuk itu. itu, dan kemudian saya merasa tidak nyaman karena diakritik yang diadopsi dalam alfabet Ceko. Dan saya hanya bisa membayangkan rasa sakit dan penderitaan orang asing melihat alfabet Sirilik, lihat alfabet Sirilik-semu dan Anda akan mengerti. Jika saya berada di tempat mereka, saya akan menulis nama dan alamat dalam bahasa Latin, mencoba mereproduksi suara - pencarian fonetik.
Dalam publikasi saya akan menjelaskan bagaimana menerapkan algoritma pencarian fonetik Soudex pada mesin pencari Sphinx . Transliterasi saja tidak akan berhasil di sini, meski tanpa itu di mana pun. File konfigurasi yang dihasilkan tersedia di GitHub Gist .
pengantar
, , -, , , Sphinx Search.
, , , .. , - Sphinx.
, , , , , . , , .
, . Soundex Metaphone, . Soundex , Metaphone .
, Sphinx Soundex, , . , , . .. . .
. , : « » – , , « », , . , , , , , .
, Soundex, , , NYSIIS, Daitch-Mokotoff.
SphinxQL, :
mysql -h 127.0.0.1 -P 9306 --default-character-set=utf8
Sphinx, , Sphinx Search, , , . .
Soundex
. , Sphinx Search, , , .. .
, : , – . .
– , Sphinx .
, , , , , : . – , - , , – . " ", . , , , .
regexp_filter = (|) => a
regexp_filter = (|) =>
, – , GitHub Gist.
soundex :
morphology = soundex
, , Sphinx Soundex.
, , Sphinx. -. - , , . . «», «», - , «Lenina», «ulitsa Lenina».
mysql> call keywords(' Lenina Lennina Lenin', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | lenin | l500 |
| 2 | lenina | l500 |
| 3 | lenina | l500 |
| 4 | lennina | l500 |
| 5 | lenin | l500 |
+------+-----------+------------+
, tokenized , . normalized, Sphinx , , morphology. 'Lenina' l500, '' l500, , - , . Lennina, Lenena, Lennona. , , .
, :
mysql> select * from STREETS where match('Lenena');
+------+--------------------------------------+-----------+--------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+-----------+--------------+
| 387 | 4b919f60-7f5d-4b9e-99af-a7a02d344767 | | |
+------+--------------------------------------+-----------+--------------+
Sphinx , . . , :
mysql> call keywords(' Plechanovskaya Plehanovskaja Plekhanovska', 'STREETS', 0);
+------+----------------+------------+
| qpos | tokenized | normalized |
+------+----------------+------------+
| 1 | plekhanovskaja | p42512 |
| 2 | plechanovskaya | p42512 |
| 3 | plehanovskaja | p4512 |
| 4 | plekhanovska | p42512 |
+------+----------------+------------+
plehanovskaja -
. Sphinx . , CALL QSUGGEST:
mysql> CALL QSUGGEST('Plehanovskaja', 'STREETS');
+----------------+----------+------+
| suggest | distance | docs |
+----------------+----------+------+
| plekhanovskaja | 1 | 1 |
| petrovskaja | 4 | 1 |
+----------------+----------+------+
, , . .. .
, :
min_infix_len = 2
suggest tokenized, .. , . , Soudex , QSUGGEST .
- :
mysql> select * from STREETS where match('30 let Pobedy');
+------+--------------------------------------+-----------+------------------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+-----------+------------------------+
| 677 | 87234d80-4098-40c0-adb2-fc83ef237a5f | | 30 |
+------+--------------------------------------+-----------+------------------------+
mysql> select * from STREETS where match('30 ');
+------+--------------------------------------+-----------+------------------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+-----------+------------------------+
| 677 | 87234d80-4098-40c0-adb2-fc83ef237a5f | | 30 |
+------+--------------------------------------+-----------+------------------------+
, .
Soundex
. , , , .
.
Sphinx index
, , , . , Sphinx , . .. , regexp_filter
, regexp_filter
.
morphology = soundex
– , . , .
Sphinx , , ! . RE2.
, : regexp_filter = \A(A|a) => a
, 0.
regexp_filter = \B(A|a) => 0
regexp_filter = \B(Y|y) => 0
...
, regexp_filter = \B(Y|y) =>
, - . , «» «Veelkaseem» .
mysql> call keywords(' Veelkaseem', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | v738 | v738 |
| 2 | v738 | v738 |
+------+-----------+------------+
- :
mysql> call keywords(' Veelkaseem', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | v738 | v738 |
| 2 | v0730308 | v0730308 |
+------+-----------+------------+
, H W .
, , /, H W, . .
regexp_filter = 0+ => 0
regexp_filter = 1+ => 1
...
:
mysql> call keywords(' Lenina Lennina Lenin', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | l8 | l8 |
| 2 | l8 | l8 |
| 3 | l8 | l8 |
| 4 | l8 | l8 |
| 5 | l8 | l8 |
+------+-----------+------------+
mysql> select * from STREETS where match('Lenina');
+------+--------------------------------------+-----------+--------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+-----------+--------------+
| 387 | 4b919f60-7f5d-4b9e-99af-a7a02d344767 | | |
+------+--------------------------------------+-----------+--------------+
, . , tokenized , soundex-. QSUGGEST . - , – . ngram_chars. .
:
mysql> call keywords(' Plechanovskaya Plehanovskaja Plekhanovska', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | p738234 | p738234 |
| 2 | p73823 | p73823 |
| 3 | p78234 | p78234 |
| 4 | p73823 | p73823 |
+------+-----------+------------+
, , QSUGGEST :
mysql> CALL QSUGGEST('Plehanovskaja', 'STREETS');
Empty set (0.00 sec)
mysql> CALL QSUGGEST('p73823', 'STREETS');
Empty set (0.00 sec)
mysql> CALL QSUGGEST('p78234', 'STREETS');
Empty set (0.00 sec)
, , , . , , . . , «30 »:
mysql> call keywords('30 let Podedy', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | 30 | 30 |
| 2 | l6 | l6 |
| 3 | p6 | p6 |
+------+-----------+------------+
mysql> select * from STREETS where match('30 let Pobedy');
+------+--------------------------------------+-----------+------------------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+-----------+------------------------+
| 677 | 87234d80-4098-40c0-adb2-fc83ef237a5f | | 30 |
+------+--------------------------------------+-----------+------------------------+
:
mysql> select * from STREETS where match('');
+------+--------------------------------------+--------------+----------------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+--------------+----------------------+
| 873 | abdb0221-bfe8-4cf8-9217-0ed40b2f6f10 | | 30 |
| 1208 | f1127b16-8a8e-4520-b1eb-6932654abdcd | | 50 |
+------+--------------------------------------+--------------+----------------------+
, , , .
NYSIIS
. «» - . «» , , - , .
(?i) .
, . :
regexp_filter = (?i)\b(mac) => mcc
regexp_filter = (?i)(ee)\b => y
: H, W
regexp_filter = (?i)(a|e|i|o|u|y)h => \1
regexp_filter = (?i)(a|e|i|o|u|y)w => \1a
regexp_filter = (?i)\B(e|i|o|u) => a
regexp_filter = (?i)\B(q) => g
S
regexp_filter = (?i)s\b =>
AY Y
A
, , !!!
, - , , , CALL QSUGGEST.
:
mysql> call keywords(' Lenina Lennina Lenin', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | lanan | lanan |
| 2 | lanan | lanan |
| 3 | lanan | lanan |
| 4 | lannan | lannan |
| 5 | lanan | lanan |
+------+-----------+------------+
mysql> call keywords(' Plechanovskaya Plehanovskaja Plekhanovska', 'STREETS', 0);
+------+---------------+---------------+
| qpos | tokenized | normalized |
+------+---------------+---------------+
| 1 | plachanavscaj | plachanavscaj |
| 2 | plachanavscay | plachanavscay |
| 3 | plaanavscaj | plaanavscaj |
| 4 | plachanavsc | plachanavsc |
+------+---------------+---------------+
, CALL QSUGGEST Plehanovskaja, plaanavscaj:
mysql> CALL QSUGGEST('plaanavscaj', 'STREETS');
+---------------+----------+------+
| suggest | distance | docs |
+---------------+----------+------+
| paanarscaj | 2 | 1 |
| plachanavscaj | 2 | 1 |
| latavscaj | 3 | 1 |
| sladcavscaj | 3 | 1 |
| pacravscaj | 3 | 1 |
+---------------+----------+------+
. - .
paanarscaj →
plachanavscaj →
latavscaj →
sladcavscaj →
pacravscaj →
- , . - . , . , , .
Daitch-Mokotoff Soundex
, , Soundex.
. , « », , , - , , - .
, .
.
, .. :
regexp_filter = (?i)\b(au) => 0
regexp_filter = (?i)(a|e|i|o|u|y)(au) => \17
, \B ,
regexp_filter = (?i)au =>
– - :
regexp_filter = (?i)j => 1
:
mysql> call keywords(' Lenina Lennina Lenin', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | 866 | 866 |
| 2 | 866 | 866 |
| 3 | 866 | 866 |
| 4 | 8666 | 8666 |
| 5 | 866 | 866 |
+------+-----------+------------+
mysql> call keywords(' Plechanovskaya Plehanovskaja Plekhanovska', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | 7856745 | 7856745 |
| 2 | 7856745 | 7856745 |
| 3 | 786745 | 786745 |
| 4 | 7856745 | 7856745 |
+------+-----------+------------+
, QSUGGEST . .
mysql> select * from STREETS where match('Veelkaseem'); show meta;
+------+--------------------------------------+--------------+----------------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+--------------+----------------------+
| 873 | abdb0221-bfe8-4cf8-9217-0ed40b2f6f10 | | 30 |
| 1208 | f1127b16-8a8e-4520-b1eb-6932654abdcd | | 50 |
+------+--------------------------------------+--------------+----------------------+
2 rows in set (0.00 sec)
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total | 2 |
| total_found | 2 |
| time | 0.000 |
| keyword[0] | 78546 |
| docs[0] | 2 |
| hits[0] | 2 |
+---------------+-------+
, , - .
Soundex, , Soundex NYSIIS, CALL QSUGGEST, Sphinx , NYSIIS -. Soundex Daitch-Mokotoff Soundex, , , , 1286 , , - . :
mysql> call keywords(' ', 'STREETS', 0);
+------+------------+------------+
| qpos | tokenized | normalized |
+------+------------+------------+
| 1 | vorovskogo | v612 |
| 2 | verbovaja | v612 |
+------+------------+------------+
Soundex, :
mysql> call keywords(' ', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | v9234 | v9234 |
| 2 | v9124 | v9124 |
+------+-----------+------------+
, . , Soundex:
mysql> select * from STREETS where match('');
+------+--------------------------------------+-----------+--------------------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+-----------+--------------------------+
| 12 | 0278d3ee-4e17-4347-b128-33f8f62c59e0 | | |
+------+--------------------------------------+-----------+--------------------------+
.
QSUGGEST, . , . , – .
, , : Soundex . - , , - , , Sphinx.
, , , Soundex Daitch-Mokotof - , . NYSIIS , , , .
sphinx-3.3.1, 2.1.1-beta, . Manticore. Manticore Search, . , , .
, . , .
P.S.
, . Metaphone . , , . :
-
????
PROFIT