形態素解析、単語分析、ワードクラウド

環境設定

Google colaboratoryで実行できます

本プログラムに必要なライブラリをインストールします

!pip install janome
!pip install wikipedia
!pip install japanize-matplotlib
!apt-get -y install fonts-ipafont-gothic

（１）いろいろな文章の分かち書きを確認してください

bunsyouに代入した文章を分かち書きします。なお、下記のコマンドはGoogle colaboratoryで実行できます

- bunsyou=""#@param{type:"string"}

#文章の分かち書き
from janome.tokenizer import Tokenizer
t = Tokenizer()
bunsyou=""#@param{type:"string"}
for token in t.tokenize(bunsyou):
print(token)

（２）Wikipediaから指定したキーワードに関するページの文章を読み込みます。

keywordで指定した単語に関するページをWikipediaで検索して、その中から、keyword_indexで指定したページを読み込みます。

#Wikipediaからのテキスト読み込み
import wikipedia
keyword = ""#@param{type:"string"}
wikipedia.set_lang("ja")
search_response = wikipedia.search(keyword)
for sr in search_response:
print(search_response.index(sr),sr)
keyword_index=#@param{type:"number"}
print('読み込みページ：',search_response[keyword_index])
page_data = wikipedia.page(search_response[keyword_index])
docs=page_data.content
docs

（３）品詞（名詞、動詞、形容詞、副詞など）を変えながら、指定した品詞の単語の出現回数を可視化してください。可視化する単語の個数も指定します。

keywordで指定した単語に関するページをWikipediaで検索して、その中から、keyword_indexで指定したページを読み込みます。

#単語の出現頻度の計算
from collections import Counter
import collections
import re
import matplotlib.pyplot as plt
import japanize_matplotlib
word=[]
for token in t.tokenize(docs):
hinshi='\u540D\u8A5E'#@param{type:"string"}
if re.match(hinshi, token.part_of_speech):
word.append(token.surface)
c_kosuu=30#@param{type:"integer"}
c=collections.Counter(word).most_common(c_kosuu)
values, counts = zip(*c)
print(values)
print(counts)
plt.bar(values,counts)
plt.show()

（４）ワードクラウド（単語の雲）で、指定した品詞の単語の出現頻度を可視化します。

#単語の出現頻度の可視化（ワードクラウド）
from wordcloud import WordCloud
text=" ".join(word)
fpath = '/usr/share/fonts/truetype/fonts-japanese-gothic.ttf'
wordcloud = WordCloud(font_path=fpath,width=800, height=400).generate(text)
plt.figure(figsize=(40,30))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

北海道医療大学・情報センター

Page updated

Report abuse