n-gram、共起ネットワーク（１）

環境設定

Google colaboratoryで実行できます

本プログラムに必要なライブラリをインストールします

#環境設定
!pip install janome
!pip install wikipedia
!pip install japanize-matplotlib
!pip install nlplot

（１）Wikipediaから指定したキーワードに関するページの文章を読み込みます。

keywordで指定した単語に関するページをWikipediaで検索して、その中から、keyword_indexで指定したページを読み込みます。

#Wikipediaからのテキスト読み込み
from janome.tokenizer import Tokenizer
t = Tokenizer()
import wikipedia
keyword = "\u5728\u5B85\u533B\u7642"#@param{type:"string"}
wikipedia.set_lang("ja")
search_response = wikipedia.search(keyword)
for sr in search_response:
print(search_response.index(sr),sr)
keyword_index=0#@param{type:"number"}
print('読み込みページ：',search_response[keyword_index])
page_data = wikipedia.page(search_response[keyword_index])
docs=page_data.content
docT=docs.replace('\n','').split('。')
#読み込んだ文章の分かち書き（名詞のみ）
from collections import Counter
import collections
import re
word=[]
for d in docT:
dd=""
for token in t.tokenize(d):
if re.match('名詞', token.part_of_speech):
dd+=" "+token.surface
word.append(dd)
import pandas as pd
df=pd.DataFrame(word,columns={"text"})
print(df)

（２）n-gram(n個の連続する単位(n-gram)での単語出現回数)のn値を変えながら、連続した単語（名詞）の出現回数を可視化してください。

ngramで指定した個数をn値とします。

#n-gramの表示
import nlplot
npt = nlplot.NLPlot(df, target_col='text')
ngram=1#@param{type:"integer"}
npt.bar_ngram(
ngram=ngram,
top_n=20,
width=800,
height=600,)

（３）ストップワード（出現頻度の高すぎる単語の除去）の設定をしてください。

top_nで指定した出現頻度の高い単語を除去します。

#ストップワード（出現頻度の高すぎる単語の除去）の設定
top_n=0#@param{type:"integer"}
stopwords = npt.get_stopword(top_n=top_n, min_freq=0)
print(stopwords)

（４）共起語（あるキーワードに対して頻繁に出現する単語）算出のためのパラメータ(min_edge_frequecy)を設定してください。node_size(円):50程度, edge_size(線):100程度を目安としてください。

#共起語の算出設定
min_edge_frequency=2#@param{type:"integer"}
npt.build_graph(stopwords=stopwords,min_edge_frequency=min_edge_frequency)
display(
npt.node_df.head(npt.node_df.shape[0]), npt.node_df.shape,
npt.edge_df.head(npt.edge_df.shape[0]), npt.edge_df.shape
)

（５）パラメータ(min_edge_frequecy)を調整しながら、「在宅医療」のWikipediaのページについての共起ネットワークを描画して、その結果から「読み取れた内容（文章）」を回答してください。また、Wikipediaのページを直接読解した内容とを比較して、共起ネットワークから「読み取れなかった重要な内容（文章）」を回答してください。

#共起ネットワークの描画
npt.co_network(
title='Co-occurrence network',
width=800,
height=600,)

北海道医療大学・情報センター

Page updated

Report abuse