Mengelompokkan dan Mengklasifikasikan Data Teks Besar dengan M.O. di Jawa. Artikel # 3 - Arsitektur / Hasil

Halo, Habr! Hari ini akan menjadi bagian terakhir dari topik Pengelompokan dan Klasifikasi Data Teks Besar Menggunakan Pembelajaran Mesin di Java. Artikel ini merupakan lanjutan dari artikel  pertama dan kedua .









Artikel tersebut menjelaskan arsitektur sistem, algoritme, dan hasil visual. Semua detail teori dan algoritme dapat ditemukan di dua artikel pertama.









Arsitektur sistem dapat dibagi menjadi dua bagian utama: aplikasi web dan pengelompokan data dan perangkat lunak klasifikasi









Algoritme perangkat lunak pembelajaran mesin terdiri dari 3 bagian utama:





  1. pemrosesan bahasa alami;





    1. tokenisasi;





    2. lemmatisasi;





    3. berhenti mendaftar;





    4. frekuensi kata;





  2. metode pengelompokan;





    1. TF-IDF;





    2. SVD;





    3. menemukan kelompok cluster;





  3. metode klasifikasi - API Aylien.





Pemrosesan bahasa alami

Algoritme dimulai dengan membaca data teks apa pun. Karena sistem kami adalah perpustakaan elektronik, kebanyakan buku dalam format pdf. Anda dapat membaca implementasi dan detail pemrosesan NLP di sini .





Di bawah ini adalah perbandingan saat menjalankan algoritma Lemmatization dan Stemmitization:





  : 4173415
    : 88547
    : 82294
      
      











, , , . , :





characterize, design, space, render, robot, face, alisa, kalegina, university, washington, seattle, washington, grace, schroeder, university, washington, seattle, washington, aidan, allchin, lakeside, also, il, school, seattle, washington, keara, berlin, macalester, college, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, university, washington, seattle, washington, abstract, face, critical, establish, agency, social, robot, building, expressive, mechanical, face, costly, difficult, robot, build, year, face, ren, der, screen, great, flexibility, robot, face, open, design, space, tablish, robot, character, perceive, property, despite, prevalence, robot, render, face, systematic, exploration, design, space, work, aim, fill, gap, conduct, survey, identify, robot, render, face, code, term, property, statistics
      
      



, :





character, design, space, render, robot, face, alisa, kalegina, univers, washington, seattl, washington, grace, schroeder, univers, washington, seattl, washington, grsuwedu, aidan, allchin, lakesid, also, il, school, seattl, washington, keara, berlin, macalest, colleg, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, univers, washington, seattl, washington, abstract, face, critic, establish, agenc, social, robot, build, express, mechan, face, cost, difficult, mani, robot, built, year, face, ren, dere, screen, great, flexibl, robot, face, open, design, space, tablish, robot, charact, perceiv, properti, despit, preval, robot, render, face, systemat, explor, design, space, work, aim, fill, gap, conduct, survey, identifi, robot, render, face, code, term, properti, statist, common, pattern, observ, data, set, face, conduct, survey, understand, peopl, percep, tion, render, robot, face, identifi, impact, differ, face, featur, survey, result, indic, prefer, vari, level, realism, detail, robot, facecharacter, design, space, render, robot, face, alisa, kalegina, univers, washington, seattl, washington, grace, schroeder, univers, washington, seattl, washington, grsuwedu, aidan, allchin, lakesid, also, il, school, seattl, washington, keara, berlin, macalest, colleg, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, univers, washington, seattl, washington, abstract, face, critic, establish, agenc, social, robot, build, express, mechan, face, cost, difficult, mani, robot, built, year, face, ren, dere, screen, great, flexibl, robot, face, open, design, space, tablish, robot, charact, perceiv, properti, despit, preval, robot, render, face, systemat, explor, design, space, work, aim, fill, gap, conduct, survey, identifi, robot, render, face, code, term, properti, statist, common, pattern, observ, data, set, face, conduct, survey, understand, peopl, percep, tion, render, robot, face, identifi, impact, differ, face, featur, survey, result, indic, prefer, vari, level, realism, detail, robot, face
      
      











tf-idf . HashMap, - , - -.





-:





tf-idf:









, , tf-idf. :





-0.0031139399383999997 0.023330604746 -1.3650204652799997E-4
-0.038380206566 0.00104373247064 0.056140327901
-0.006980774822399999 0.073057418689 -0.0035209342337999996
-0.0047152503238 0.0017397257449 0.024816828582999998
-0.005195951771999999 0.03189764447 -5.9991080912E-4
-0.008568593700999999 0.114337675179 -0.0088221197958
-0.00337365927 0.022604474721999997 -1.1457816390099999E-4
-0.03938283525 -0.0012682796482399999 0.0023486548592
-0.034341362795999995 -0.00111758118864 0.0036010404917
-0.0039026609385999994 0.0016699372352999998 0.021206653766000002
-0.0079418490394 0.003116062838 0.072380311755
-0.007021828444599999 0.0036496566028 0.07869801528199999
-0.0030219410092 0.018637386319 0.00102082843809
-0.0042041069026 0.023621439238999998 0.0022947637053
-0.0061050946438 0.00114796066823 0.018477825284
-0.0065708646563999995 0.0022944737838999996 0.035902813761
-0.037790461814 -0.0015372596281999999 0.008878823611899999
-0.13264545848599998 -0.0144908102251 -0.033606397957999995
-0.016229093174 1.41831464625E-4 0.005181988760999999
-0.024075296507999996 -8.708131965899999E-4 0.0034344653516999997

      
      











SVD   .





, .  โ€“ , . OrientDB , OrientDB . OrientDB , , , . . .





, .









โ€“ . , , DBSCAN. . . r=0.007. 562 80.000 , . , .





r = maks (D) / n









   max(D)  โ€’ , . n -













, . โ€“ , โ€“









, . 4-. ( > nt)





nt = N / S

Nโ€’ - , S โ€’ .









, .





โ€“ Aylien API





Aylien API . API json , . API . 9 , . POST API:





String queryText = "select  DocText from documents where clusters = '" + cluster + "'";
   OResultSet resultSet = database.query(queryText);
   while (resultSet.hasNext()) {
   OResult result = resultSet.next();

   String textDoc = result.toString().replaceAll("[\\<||\\>||\\{||\\}]", "").replaceAll("doctext:", "")
   .toLowerCase();
   keywords.add(textDoc.replaceAll("\\n", ""));
   }

   ClassifyByTaxonomyParams.Builder classifyByTaxonomybuilder    = ClassifyByTaxonomyParams.newBuilder();
   classifyByTaxonomybuilder.setText(keywords.toString());
   classifyByTaxonomybuilder.setTaxonomy(ClassifyByTaxonomyParams.StandardTaxonomy.IAB_QAG);
   TaxonomyClassifications response = client.classifyByTaxonomy(classifyByTaxonomybuilder.build());
   for (TaxonomyCategory c : response.getCategories()) {
   clusterUpdate.add(c.getLabel());
   }

      
      







GET, :









. .













. . , . . , . , :









-





- โ€“ . , . - , . Vaadin Flow:









:





  • , .





  • .





  • -.





  • , , , , -.





  • .













โ€œTechnology & Computingโ€:









:









:









, . . , , . . . . : .





, , , -, tf-idf, . , . DBSCAN . . , , . , , , , ..





, NoSQL , OrinetDB, 4 NoSQL. , . OrientDB , .





Aylien API, . , 100 . , , , k-, . , .








All Articles