Analisis sebuah artikel tentang cara mengekstrak makna dari pernikahan

tl; dr: Sebuah analisis yang disederhanakan dari artikel di mana penulis menawarkan dua teorema yang menarik atas dasar di mana ia menemukan cara untuk mengekstrak vektor makna tersembunyi dari matriks penyematan. Ada panduan tentang cara mereproduksi hasil. Laptop ini tersedia di github .



pengantar



Dalam artikel ini saya ingin berbicara tentang satu hal menakjubkan yang ditemukan oleh peneliti Sanjev Arora dalam artikel Linear Algebraic Structure of Word Senses, dengan Applications to Polysemy . Ini adalah salah satu dari serangkaian artikel di mana ia mencoba memberikan dasar teoretis untuk sifat-sifat embeddings kata. Dalam karya yang sama, Arora membuat asumsi bahwa embeddings sederhana, seperti word2vec atau Glove, sebenarnya mencakup beberapa makna untuk satu kata dan menawarkan cara untuk mengembalikannya. Dalam perjalanan artikel saya akan mencoba untuk tetap berpegang pada contoh asli.



Lebih formal, untuk υtiemarilah kita menunjuk vektor embedding tertentu dari kata dasi , yang dapat memiliki arti simpul atau dasi, atau dapat menjadi kata kerja "dasi". Arora menyarankan bahwa vektor ini dapat ditulis sebagai kombinasi linear berikut



υtieα1υtie1+α2υtie2+α3υtie3+...



Dimana υtienini adalah salah satu arti yang mungkin dari kata dasi , danα- koefisien. Mari kita coba cari tahu bagaimana hasilnya.



Teori



Penolakan

, , . .



Catatan kecil tentang teori Arora



Karena pekerjaan awal Arora jauh lebih rumit dari ini, saya belum menyiapkan ulasan lengkap. Namun, kita akan melihat secara singkat bagaimana rasanya.



Jadi, Arora menawarkan gagasan bahwa teks apa pun dihasilkan oleh model generatif. Dalam proses pekerjaannya di setiap langkah waktut sebuah kata dihasilkan w. Model ini terdiri dari vektor konteks dan vektor embeddings uw. (dimensions), , . , , - (, ), — (, ), , , — .



, .. - , . . , . : " " , " ". , "": , .



, . , , , .

: , . , t w



P(w|ct)=1Zcexp<ct,υw>



ctt, υww, Zc=wexp<c,υw> — partition function. , , .



. , , : , , , . Y, X .



. - , - .



, , . , , "". :



, ", , , ". , , , ", , , " , " " .





, . , , . , ( , ). , , . .



1



, s n . A ,



υwAE[1nwisυwi|ws]



, . . w . S. , υs sS, u. , , u υw A ( ). , , out-of-vocabulary , , .



, . , SIF . , , , . , SIF υSIF k, , w, TF-IDF.



υSIF=1kn=1kυntf_idf(wn)



, , 1, c. , - , , .



. , - w, υw , . :



  1. . V.
  2. wV, , SIF 20 w, . wV (νw1,νw2,,...νwn,), n — w .
  3. uw SIF wV uw=1nt=1nνwt.
  4. argminAA||Auwυw||22
  5. SIF υw=Auw


, .. . 1/3 , A 2\3 . . .



#paragraphs 250k 500k 750k 1 million
cos similarity 0.94 0.95 0.96 0.96


2



, w s1 s2. υw - , . , , .. , , tie_1 tie_2, tie_1 — , tie2 — .

, , $<!-- math>$inline$ \upsilon
{w{s1} } </math>$$<!math>\ upsilon{w_{s2} } $inline$</math -->$. , , , υwυ0, υ



υw=f1f1+f2υs1+f2f1+f2υs2=αυs1+βυs2



f1 f2 s1 and s2 . , , .



, , , , ? , alpha. . , c . , , . , , , , , . , , , (inner product) . , , - (, , , ), υtie1 , ! .



. ? d k,n. k<n, A1,A2,...,Am, ,



υw=j=1mαw,jAj+μw



k α μw — .



wυwj=1mαw,jAj22



, k (sparsity parameter), m — .. , . k-SVD. , . , A , ( , A ). , , - Ai , , , m . .





, , .



import numpy as np

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from scipy.spatial.distance import cosine
import warnings
warnings.filterwarnings('ignore')


1. Gensim

GloVe.

, 300- .



tmp_file = get_tmpfile("test_word2vec.txt")
_ = glove2word2vec("/home/astromis/Embeddings/glove.6B.300d.txt", tmp_file)
model = KeyedVectors.load_word2vec_format(tmp_file)


embeddings = model.wv

index2word = embeddings.index2word
embedds = embeddings.vectors


print(embedds.shape)


(400000, 300)


400000 .



2. k-svd

. ksvd.



!pip install ksvd
from ksvd import ApproximateKSVD


Requirement already satisfied: ksvd in /home/astromis/anaconda3/lib/python3.6/site-packages (0.0.3)
Requirement already satisfied: numpy in /home/astromis/anaconda3/lib/python3.6/site-packages (from ksvd) (1.14.5)
Requirement already satisfied: scikit-learn in /home/astromis/anaconda3/lib/python3.6/site-packages (from ksvd) (0.19.1)


, 2000 5.

: 10000 . , , , , .



%time
aksvd = ApproximateKSVD(n_components=2000,transform_n_nonzero_coefs=5, )
embedding_trans = embeddings.vectors
dictionary = aksvd.fit(embedding_trans).components_
gamma = aksvd.transform(embedding_trans)


CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 9.54 µs


#gamma = np.load('./data/mats/.npz')
# dictionary_glove6b_300d.np.npz - whole matrix file
dictionary = np.load('./data/mats/dictionary_glove6b_300d_10000.np.npz')
dictionary = dictionary[dictionary.keys()[0]]


#print(gamma.shape)
print(dictionary.shape)


(2000, 300)


#np.savez_compressed('gamma_glove6b_300d.npz', gamma)
#np.savez_compressed('dictionary_glove6b_300d.npz', dictionary)


3.



, . .



embeddings.similar_by_vector(dictionary[1354,:])


[('slave', 0.8417330980300903),
 ('slaves', 0.7482961416244507),
 ('plantation', 0.6208109259605408),
 ('slavery', 0.5356900095939636),
 ('enslaved', 0.4814416170120239),
 ('indentured', 0.46423888206481934),
 ('fugitive', 0.4226764440536499),
 ('laborers', 0.41914862394332886),
 ('servitude', 0.41276970505714417),
 ('plantations', 0.4113745093345642)]


embeddings.similar_by_vector(dictionary[1350,:])


[('transplant', 0.7767853736877441),
 ('marrow', 0.699995219707489),
 ('transplants', 0.6998592615127563),
 ('kidney', 0.6526087522506714),
 ('transplantation', 0.6381147503852844),
 ('tissue', 0.6344675421714783),
 ('liver', 0.6085026860237122),
 ('blood', 0.5676015615463257),
 ('heart', 0.5653558969497681),
 ('cells', 0.5476219058036804)]


embeddings.similar_by_vector(dictionary[1546,:])


[('commons', 0.7160810828208923),
 ('house', 0.6588335037231445),
 ('parliament', 0.5054076910018921),
 ('capitol', 0.5014163851737976),
 ('senate', 0.4895153343677521),
 ('hill', 0.48859673738479614),
 ('inn', 0.4566132128238678),
 ('congressional', 0.4341348707675934),
 ('congress', 0.42997264862060547),
 ('parliamentary', 0.4264637529850006)]


embeddings.similar_by_vector(dictionary[1850,:])


[('okano', 0.2669774889945984),
 ('erythrocytes', 0.25755012035369873),
 ('windir', 0.25621023774147034),
 ('reapportionment', 0.2507009208202362),
 ('qurayza', 0.2459488958120346),
 ('taschen', 0.24417680501937866),
 ('pfaffenbach', 0.2437630295753479),
 ('boldt', 0.2394050508737564),
 ('frucht', 0.23922981321811676),
 ('rulebook', 0.23821482062339783)]


! , . . , , . "tie" "spring" .



itie = index2word.index('tie')
ispring = index2word.index('spring')

tie_emb = embedds[itie]
string_emb = embedds[ispring]


simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, tie_emb), i) )

simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:15]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))


Atom #162: win victory winning victories wins won 2-1 scored 3-1 scoring
Atom #58: game play match matches games played playing tournament players stadium
Atom #237: 0-0 1-1 2-2 3-3 draw 0-1 4-4 goalless 1-0 1-2
Atom #622: wrapped wrap wrapping holding placed attached tied hold plastic held
Atom #1899: struggles tying tied inextricably fortunes struggling tie intertwined redefine define
Atom #1941: semifinals quarterfinals semifinal quarterfinal finals semis semi-finals berth champions quarter-finals
Atom #1074: qualifier quarterfinals semifinal semifinals semi finals quarterfinal champion semis champions
Atom #1914: wearing wore jacket pants dress wear worn trousers shirt jeans
Atom #281: black wearing man pair white who girl young woman big
Atom #1683: overtime extra seconds ot apiece 20-17 turnovers 3-2 halftime overtimes
Atom #369: snap picked snapped pick grabbed picks knocked picking bounced pulled
Atom #98: first team start final second next time before test after
Atom #1455: after later before when then came last took again but
Atom #1203: competitions qualifying tournaments finals qualification matches qualifiers champions competition competed
Atom #1602: hat hats mask trick wearing wears sunglasses trademark wig wore


simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, string_emb), i) )

simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:15]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))


Atom #528: autumn spring summer winter season rainy seasons fall seasonal during
Atom #1070: start begin beginning starting starts begins next coming day started
Atom #931: holiday christmas holidays easter thanksgiving eve celebrate celebrations weekend festivities
Atom #1455: after later before when then came last took again but
Atom #754: but so not because even only that it this they
Atom #688: yankees yankee mets sox baseball braves steinbrenner dodgers orioles torre
Atom #1335: last ago year months years since month weeks week has
Atom #252: upcoming scheduled preparations postponed slated forthcoming planned delayed preparation preparing
Atom #619: cold cool warm temperatures dry cooling wet temperature heat moisture
Atom #1775: garden gardens flower flowers vegetable ornamental gardeners gardening nursery floral
Atom #21: dec. nov. oct. feb. jan. aug. 27 28 29 june
Atom #84: celebrations celebration marking festivities occasion ceremonies celebrate celebrated celebrating ceremony
Atom #98: first team start final second next time before test after
Atom #606: vacation lunch hour spend dinner hours time ramadan brief workday
Atom #384: golden moon hemisphere mars twilight millennium dark dome venus magic


! , , , .

, , . , , .



. fastText, RusVectores. 300.



fasttext_model = KeyedVectors.load('/home/astromis/Embeddings/fasttext/model.model')


embeddings = fasttext_model.wv

index2word = embeddings.index2word
embedds = embeddings.vectors


embedds.shape


(164996, 300)


%time
aksvd = ApproximateKSVD(n_components=2000,transform_n_nonzero_coefs=5, )
embedding_trans = embeddings.vectors[:10000]
dictionary = aksvd.fit(embedding_trans).components_
gamma = aksvd.transform(embedding_trans)


CPU times: user 1 µs, sys: 2 µs, total: 3 µs
Wall time: 6.2 µs


dictionary = np.load('./data/mats/dictionary_rus_fasttext_300d.npz')
dictionary = dictionary[dictionary.keys()[0]]


embeddings.similar_by_vector(dictionary[1024,:], 20)


[('', 0.6854609251022339),
 ('', 0.6593252420425415),
 ('', 0.6360634565353394),
 ('', 0.5998549461364746),
 ('', 0.5971367955207825),
 ('', 0.5862340927124023),
 ('', 0.5788886547088623),
 ('', 0.5788123607635498),
 ('', 0.5623885989189148),
 ('', 0.5610565543174744),
 ('', 0.5551878809928894),
 ('', 0.551397442817688),
 ('', 0.5356274247169495),
 ('', 0.531707227230072),
 ('', 0.5174376368522644),
 ('', 0.5131562948226929),
 ('', 0.5120065212249756),
 ('', 0.5077806115150452),
 ('', 0.5074601173400879),
 ('', 0.5068254470825195)]


embeddings.similar_by_vector(dictionary[1582,:], 20)


[('', 0.45191124081611633),
 ('', 0.4515378475189209),
 ('', 0.4478364586830139),
 ('', 0.4280813932418823),
 ('', 0.41220104694366455),
 ('', 0.40772825479507446),
 ('', 0.4047147035598755),
 ('', 0.4030646085739136),
 ('', 0.39368513226509094),
 ('', 0.39012178778648376),
 ('', 0.3866344690322876),
 ('', 0.37968817353248596),
 ('', 0.3728911876678467),
 ('', 0.3663109242916107),
 ('', 0.3640827238559723),
 ('', 0.3474290072917938),
 ('', 0.3473641574382782),
 ('', 0.3468908369541168),
 ('', 0.34586742520332336),
 ('', 0.34555742144584656)]


embeddings.similar_by_vector(dictionary[500,:], 20)


[('', 0.6874514222145081),
 ('-', 0.5172050595283508),
 ('', 0.46720415353775024),
 ('', 0.44713956117630005),
 ('', 0.4144558310508728),
 ('', 0.40545403957366943),
 ('', 0.4030636250972748),
 ('-', 0.4016447067260742),
 ('', 0.38331469893455505),
 ('', 0.37292781472206116),
 ('', 0.3625457286834717),
 ('', 0.35121074318885803),
 ('', 0.3504621088504791),
 ('', 0.34097471833229065),
 ('', 0.33320850133895874),
 ('', 0.3277249336242676),
 ('', 0.3266661763191223),
 ('', 0.31865227222442627),
 ('::', 0.30150306224823),
 ('', 0.2975207567214966)]


itie = index2word.index('')
ispring = index2word.index('')

tie_emb = embedds[itie]
string_emb = embedds[ispring]


simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, string_emb), i) )

simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:10]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))


Atom #185:          
Atom #1217:         - 
Atom #1213:          
Atom #1978:          
Atom #1796:          
Atom #839:          
Atom #989:          
Atom #414:          
Atom #1140:       -   
Atom #878:          


simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, tie_emb), i) )

simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:10]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))


Atom #883:          -
Atom #40:          
Atom #215:          
Atom #688:          
Atom #386:          
Atom #676:          
Atom #414:          
Atom #127:          
Atom #592:          
Atom #703:    - -     


#np.savez_compressed('./data/mats/gamma_rus_fasttext_300d.npz', gamma)
#np.savez_compressed('./data/mats/dictionary_rus_fasttext_300d.npz', dictionary)


.





, (Word sense indection), , 1. — , . , , . , , , . , .



UPD: knagaev .




All Articles