Halo, Habr! Hari ini akan menjadi bagian terakhir dari topik Pengelompokan dan Klasifikasi Data Teks Besar Menggunakan Pembelajaran Mesin di Java. Artikel ini merupakan lanjutan dari artikel pertama dan kedua .
Artikel tersebut menjelaskan arsitektur sistem, algoritme, dan hasil visual. Semua detail teori dan algoritme dapat ditemukan di dua artikel pertama.
Arsitektur sistem dapat dibagi menjadi dua bagian utama: aplikasi web dan pengelompokan data dan perangkat lunak klasifikasi
Algoritme perangkat lunak pembelajaran mesin terdiri dari 3 bagian utama:
pemrosesan bahasa alami;
tokenisasi;
lemmatisasi;
berhenti mendaftar;
frekuensi kata;
metode pengelompokan;
TF-IDF;
SVD;
menemukan kelompok cluster;
metode klasifikasi - API Aylien.
Pemrosesan bahasa alami
Algoritme dimulai dengan membaca data teks apa pun. Karena sistem kami adalah perpustakaan elektronik, kebanyakan buku dalam format pdf. Anda dapat membaca implementasi dan detail pemrosesan NLP di sini .
Di bawah ini adalah perbandingan saat menjalankan algoritma Lemmatization dan Stemmitization:
: 4173415 : 88547 : 82294
, , , . , :
characterize, design, space, render, robot, face, alisa, kalegina, university, washington, seattle, washington, grace, schroeder, university, washington, seattle, washington, aidan, allchin, lakeside, also, il, school, seattle, washington, keara, berlin, macalester, college, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, university, washington, seattle, washington, abstract, face, critical, establish, agency, social, robot, building, expressive, mechanical, face, costly, difficult, robot, build, year, face, ren, der, screen, great, flexibility, robot, face, open, design, space, tablish, robot, character, perceive, property, despite, prevalence, robot, render, face, systematic, exploration, design, space, work, aim, fill, gap, conduct, survey, identify, robot, render, face, code, term, property, statistics
, :
character, design, space, render, robot, face, alisa, kalegina, univers, washington, seattl, washington, grace, schroeder, univers, washington, seattl, washington, grsuwedu, aidan, allchin, lakesid, also, il, school, seattl, washington, keara, berlin, macalest, colleg, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, univers, washington, seattl, washington, abstract, face, critic, establish, agenc, social, robot, build, express, mechan, face, cost, difficult, mani, robot, built, year, face, ren, dere, screen, great, flexibl, robot, face, open, design, space, tablish, robot, charact, perceiv, properti, despit, preval, robot, render, face, systemat, explor, design, space, work, aim, fill, gap, conduct, survey, identifi, robot, render, face, code, term, properti, statist, common, pattern, observ, data, set, face, conduct, survey, understand, peopl, percep, tion, render, robot, face, identifi, impact, differ, face, featur, survey, result, indic, prefer, vari, level, realism, detail, robot, facecharacter, design, space, render, robot, face, alisa, kalegina, univers, washington, seattl, washington, grace, schroeder, univers, washington, seattl, washington, grsuwedu, aidan, allchin, lakesid, also, il, school, seattl, washington, keara, berlin, macalest, colleg, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, univers, washington, seattl, washington, abstract, face, critic, establish, agenc, social, robot, build, express, mechan, face, cost, difficult, mani, robot, built, year, face, ren, dere, screen, great, flexibl, robot, face, open, design, space, tablish, robot, charact, perceiv, properti, despit, preval, robot, render, face, systemat, explor, design, space, work, aim, fill, gap, conduct, survey, identifi, robot, render, face, code, term, properti, statist, common, pattern, observ, data, set, face, conduct, survey, understand, peopl, percep, tion, render, robot, face, identifi, impact, differ, face, featur, survey, result, indic, prefer, vari, level, realism, detail, robot, face
tf-idf . HashMap, - , - -.
-:
, , tf-idf. :
-0.0031139399383999997 0.023330604746 -1.3650204652799997E-4
-0.038380206566 0.00104373247064 0.056140327901
-0.006980774822399999 0.073057418689 -0.0035209342337999996
-0.0047152503238 0.0017397257449 0.024816828582999998
-0.005195951771999999 0.03189764447 -5.9991080912E-4
-0.008568593700999999 0.114337675179 -0.0088221197958
-0.00337365927 0.022604474721999997 -1.1457816390099999E-4
-0.03938283525 -0.0012682796482399999 0.0023486548592
-0.034341362795999995 -0.00111758118864 0.0036010404917
-0.0039026609385999994 0.0016699372352999998 0.021206653766000002
-0.0079418490394 0.003116062838 0.072380311755
-0.007021828444599999 0.0036496566028 0.07869801528199999
-0.0030219410092 0.018637386319 0.00102082843809
-0.0042041069026 0.023621439238999998 0.0022947637053
-0.0061050946438 0.00114796066823 0.018477825284
-0.0065708646563999995 0.0022944737838999996 0.035902813761
-0.037790461814 -0.0015372596281999999 0.008878823611899999
-0.13264545848599998 -0.0144908102251 -0.033606397957999995
-0.016229093174 1.41831464625E-4 0.005181988760999999
-0.024075296507999996 -8.708131965899999E-4 0.0034344653516999997
SVD .
, . โ , . OrientDB , OrientDB . OrientDB , , , . . .
, .
โ . , , DBSCAN. . . r=0.007. 562 80.000 , . , .
max(D) โ , . n -
, . โ , โ
, . 4-. ( > nt)
Nโ - , S โ .
, .
โ Aylien API
Aylien API . API json , . API . 9 , . POST API:
String queryText = "select DocText from documents where clusters = '" + cluster + "'";
OResultSet resultSet = database.query(queryText);
while (resultSet.hasNext()) {
OResult result = resultSet.next();
String textDoc = result.toString().replaceAll("[\\<||\\>||\\{||\\}]", "").replaceAll("doctext:", "")
.toLowerCase();
keywords.add(textDoc.replaceAll("\\n", ""));
}
ClassifyByTaxonomyParams.Builder classifyByTaxonomybuilder = ClassifyByTaxonomyParams.newBuilder();
classifyByTaxonomybuilder.setText(keywords.toString());
classifyByTaxonomybuilder.setTaxonomy(ClassifyByTaxonomyParams.StandardTaxonomy.IAB_QAG);
TaxonomyClassifications response = client.classifyByTaxonomy(classifyByTaxonomybuilder.build());
for (TaxonomyCategory c : response.getCategories()) {
clusterUpdate.add(c.getLabel());
}
GET, :
. .
. . , . . , . , :
-
- โ . , . - , . Vaadin Flow:
:
, .
.
-.
, , , , -.
.
โTechnology & Computingโ:
:
:
, . . , , . . . . : .
, , , -, tf-idf, . , . DBSCAN . . , , . , , , , ..
, NoSQL , OrinetDB, 4 NoSQL. , . OrientDB , .
Aylien API, . , 100 . , , , k-, . , .