📢 🎒 🥇 Mengumpulkan data untuk pelatihan pemecahan masalah NLP 🤳🏿 👨🏾‍🎤 🤱

Memilih sumber dan alat implementasi

Sebagai sumber informasi, saya memutuskan untuk menggunakan habr.com - blog kolektif dengan elemen situs berita (berita, artikel analitik, artikel tentang teknologi informasi, bisnis, internet, dll. Diterbitkan). Pada sumber daya ini, semua bahan dibagi menjadi beberapa kategori (hub), yang hanya yang utama - 416 buah. Setiap bahan bisa termasuk dalam satu atau lebih kategori.

() python. – Jupyter notebook Google Colab. :

BeautifulSoup – html / xml;
Requests – http ;
Re – ;
Pandas – .

tqdm ratelim ( ).

, . :

mainUrl = 'https://habr.com/ru/post/'
postCount = 10000

, , , . try… except requests. :

@ratelim.patient(1, 1)
def get_post(postNum):
currPostUrl = mainUrl + str(postNum)
try:
response = requests.get(currPostUrl)
response.raise_for_status()
response_title, response_post, response_numComment, response_rating, response_ratingUp, response_ratingDown, response_bookMark, response_views = executePost(response)
dataList = [postNum, currPostUrl, response_title, response_post, response_numComment, response_rating, response_ratingUp, response_ratingDown, response_bookMark, response_views]
habrParse_df.loc[len(habrParse_df)] = dataList
except requests.exceptions.HTTPError as err:
pass

– . try – , .

executePost - .

def executePost(page):
soup = bs(page.text, 'html.parser')
#   
title = soup.find('meta', property='og:title')
title = str(title).split('="')[1].split('" ')[0]
#   
post = str(soup.find('div', id="post-content-body"))
post = re.sub('\n', ' ', post)
#   
num_comment = soup.find('span', id='comments_count').text
num_comment = int(re.sub('\n', '', num_comment).strip())
#  -     
info_panel = soup.find('ul', attrs={'class' : 'post-stats post-stats_post js-user_'})
#   
try:
rating = int(info_panel.find('span', attrs={'class' : 'voting-wjt__counter js-score'}).text)
except:
rating = info_panel.find('span', attrs={'class' : 'voting-wjt__counter voting-wjt__counter_positive js-score'})
if rating:
rating = int(re.sub('/+', '', rating.text))
else:
rating = info_panel.find('span', attrs={'class' : 'voting-wjt__counter voting-wjt__counter_negative js-score'}).text
rating = - int(re.sub('–', '', rating))
#         
vote = info_panel.find_all('span')[0].attrs['title']
rating_upVote = int(vote.split(':')[1].split('')[0].strip().split('↑')[1])
rating_downVote = int(vote.split(':')[1].split('')[1].strip().split('↓')[1])
#     
bookmk = int(info_panel.find_all('span')[1].text)
#    
views = info_panel.find_all('span')[3].text
return title, post, num_comment, rating, rating_upVote, rating_downVote, bookmk, views

BeautifulSoup : soup = bs(page.text, ‘html.parser’). find / findall (, html-). , html-, , .

( ), . , 10 . tqdm .

for pc in tqdm(range(postCount)):
postNum = pc + 1
get_post(postNum)

pandas :

Akibatnya, saya menerima kumpulan data yang berisi teks artikel dari sumber daya habr.com , serta informasi tambahan - judul, tautan ke artikel, jumlah komentar, peringkat, jumlah bookmark, jumlah tampilan .

Kedepannya, dataset yang dihasilkan dapat diperkaya dengan data tambahan dan digunakan untuk pelatihan dalam membangun berbagai model bahasa, mengklasifikasikan teks, dll.

Mengumpulkan data untuk pelatihan pemecahan masalah NLP

More articles: