Fessで集めた記事に対してPythonで実装したtf-idfを適用して特徴的な単語を取得する

事前作業

まずpythonにelasticsearchのクライアントをインストールします。

pip install elasticsearch

それからFessでWebサイトをクロールさせて記事を収集させます。

$ curl -XGET http://localhost:9201/_cat/indices/fess.20180701\*\?v
health status index         uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   fess.20180701 uclmK-llQcafFyuecoaaaA   5   0        116           21     14.7mb         14.7mb

現在116件の記事が集まっているので、これを使って記事の中から特徴的な単語を集めたいと思います。

tf-idfとは

tf-idfは、文書中に含まれる単語の重要度を評価する手法の1つであり、主に情報検索やトピック分析などの分野で用いられている。 tf-idfは、tf（Term Frequency、単語の出現頻度）とidf（Inverse Document Frequency、逆文書頻度）の二つの指標に基づいて計算される。
トピック分析では他にLatent Dirichlet Allocation(LDA)が有名ですが、tf-idfは分かりやすくて実装しやすいのでまずはこちらで検証したいと思います。
tf-idfはtfとidfにそれぞれ分けることができ、tf-idfはその結果を掛け合わせたものになります。tfはある文章内に現れるある単語の出現頻度を表していて以下の式で表すことができます。 f:id:steavevaivai:20180708085712p:plain
分母の部分はある文章dに現れる全ての単語の出現回数の合計となっていて、分子の部分はある文章d内に現れる単語tの出現回数となっておりこれで文章d内の単語tの出現頻度を求めています。

それからidfは以下の式で表すことができ分子は全ての文書数、分母はある全ての文書中にある単語が現れる文書数になります。これよりidfは逆文書頻度(文書に稀に現れるほど大きな値)を求めています。
f:id:steavevaivai:20180708085809p:plain

tf-idfはこの結果を掛け合わせることで他の文章に現れない単語が沢山出るほどその文章のトピックだとしています。

pythonでの実装について

pythonでの実装は以下のようになりました。

from elasticsearch import Elasticsearch
import functools as f
from janome.tokenizer import Tokenizer
from janome.tokenizer import Tokenizer
from janome.analyzer import Analyzer
from janome.charfilter import *
from janome.tokenfilter import *
import math

es = Elasticsearch("http://127.0.0.1:9201")
def scroll(index, doc_type, query_body, page_size=100, scroll='2m'):
    page = es.search(index=index, doc_type=doc_type, scroll=scroll, size=page_size, body=query_body)
    sid = page['_scroll_id']
    scroll_size = page['hits']['total']
    total_pages = math.ceil(scroll_size/page_size)
    page_counter = 0
    # Start scrolling
    while (scroll_size > 0):
        # Get the number of results that we returned in the last scroll
        scroll_size = len(page['hits']['hits'])
        if scroll_size>0:
            yield total_pages, page_counter, scroll_size, page
        # get next page
        page = es.scroll(scroll_id = sid, scroll = '2m')
        page_counter += 1
        # Update the scroll ID
        sid = page['_scroll_id']

def wordToDicGen(tokenizer, char_filters, token_filters, stop_words):
    def wordToDic(text):
        dic = dict()
        for token in Analyzer(char_filters, tokenizer, token_filters).analyze(text):
            if token.base_form in stop_words:
                continue
            if token.base_form in dic:
                dic[token.base_form] = dic[token.base_form] + 1
            else:
                dic[token.base_form] = 1
        return dic
    return wordToDic

class TfIdf():
    doc_num=0
    word_counter = dict()

    def addDocument(self, wordCountDic):
        self.doc_num += 1
        for key in wordCountDic:
            if key in self.word_counter:
                self.word_counter[key] += 1
            else:
                self.word_counter[key] = 1

    def answer(self, wordCountDic):
        ans = dict()
        document_words_sum = f.reduce(lambda x,y:x+y, list(wordCountDic.values()))
        for word in wordCountDic:
            tf = self.tf(document_words_sum, wordCountDic[word])
            idf = self.idf(word)
            ans[word] = tf*idf
        return ans

    def tf(self, document_words_sum, word_num):
        return word_num / document_words_sum

    def idf(self, word):
        return math.log(float(self.doc_num / self.word_counter[word]), math.e) + 1

char_filters = [UnicodeNormalizeCharFilter()]
tokenizer = Tokenizer()
token_filters = [CompoundNounFilter(), POSStopFilter(['記号','助詞', '助動詞', '助動詞']), LowerCaseFilter()]
stop_words=set(["する", "れる", "こと", "いる", "行う", "できる", "場合", "の", "ない", "みる", "使う", "より", "なる"])
wordToDic = wordToDicGen(tokenizer, char_filters, token_filters, stop_words)


tfidf = TfIdf()
index = 'fess.20180701'
doc_type = 'doc'
query = { "query": { "match_all": {} }}
page_size =30

currrent_volume=0
for total_pages, page_counter, page_items, page_data in scroll(index, doc_type, query, page_size=page_size):
    for data in page_data['hits']['hits']:
        tfidf.addDocument(wordToDic(data['_source']['content']))
    currrent_volume += page_items

page_size =1

currrent_volume=0
for total_pages, page_counter, page_items, page_data in scroll(index, doc_type, query, page_size=page_size):
    for data in page_data['hits']['hits']:
        print(data['_source']['url'])
        for k, v in sorted(tfidf.answer(wordToDic(data['_source']['content'])).items(), key=lambda x: -x[1])[:10]:
            print(k, v)
    currrent_volume += page_items
    if currrent_volume >= 10:
        break

それぞれの処理について、 elasticsearchからのデータの読み込みは以下の関数を使っています。

def scroll(index, doc_type, query_body, page_size=100, scroll='2m'):
    page = es.search(index=index, doc_type=doc_type, scroll=scroll, size=page_size, body=query_body)
    sid = page['_scroll_id']
    scroll_size = page['hits']['total']
    total_pages = math.ceil(scroll_size/page_size)
    page_counter = 0
    # Start scrolling
    while (scroll_size > 0):
        # Get the number of results that we returned in the last scroll
        scroll_size = len(page['hits']['hits'])
        if scroll_size>0:
            yield total_pages, page_counter, scroll_size, page
        # get next page
        page = es.scroll(scroll_id = sid, scroll = '2m')
        page_counter += 1
        # Update the scroll ID
        sid = page['_scroll_id']

それから以下の関数でjanomeを使って形態素解析した結果に対して単語をkeyと単語の数を値とした辞書を作成しています。

def wordToDicGen(tokenizer, char_filters, token_filters, stop_words):
    def wordToDic(text):
        dic = dict()
        for token in Analyzer(char_filters, tokenizer, token_filters).analyze(text):
            if token.base_form in stop_words:
                continue
            if token.base_form in dic:
                dic[token.base_form] = dic[token.base_form] + 1
            else:
                dic[token.base_form] = 1
        return dic
    return wordToDic

tf-idfは以下のクラスを使っています。idfの計算に使うドキュメント数と単語が現れるドキュメント数の辞書をメンバ変数としていて、最初にaddDocumentで全てのドキュメントを読み込ませた後answerを実行することで単語をkey、単語のtf-idfをvalueとした辞書を返しています。

class TfIdf():
    doc_num=0
    word_counter = dict()

    def addDocument(self, wordCountDic):
        self.doc_num += 1
        for key in wordCountDic:
            if key in self.word_counter:
                self.word_counter[key] += 1
            else:
                self.word_counter[key] = 1

    def answer(self, wordCountDic):
        ans = dict()
        document_words_sum = f.reduce(lambda x,y:x+y, list(wordCountDic.values()))
        for word in wordCountDic:
            tf = self.tf(document_words_sum, wordCountDic[word])
            idf = self.idf(word)
            ans[word] = tf*idf
        return ans

    def tf(self, document_words_sum, word_num):
        return word_num / document_words_sum

    def idf(self, word):
        return math.log(float(self.doc_num / self.word_counter[word]), math.e) + 1

これよりFessで集めた記事に対してtf-idfでトピックの単語を求めたら以下のようになりました。まあまあそれらしいのは出ていそうです。

https://gigazine.net/news/20180706-china-ap1000-epr/
原子力産業 0.023317488109853554
遅れる 0.01886515867249546
ap1000 0.018473965793423773
新型原子炉 0.017488116082390166
原子力発電所 0.017488116082390166
wh 0.017488116082390166
東芝 0.017488116082390166
アメリカ 0.015243532167490574
中国 0.014023731901107834
最 0.013274455410293236
https://gigazine.net/news/20180706-virus-desides-mining-or-ransomware/
ファイル 0.03633063623601737
カスペルスキー 0.03265374682807244
マイニングマルウェア 0.026122997462457952
コンピューター 0.021134973450343948
侵入 0.019592248096843467
身代金 0.019592248096843467
暗号化 0.017287991110333946
ランサムウェア 0.016922932055221898
マルウェア 0.015427253535004832
攻撃 0.015087760100809063
https://gigazine.net/news/20180706-london-police-facial-recognition/
afs 0.03595993869441478
運用 0.024244676575199242
犯罪容疑者 0.023973292462943184
イギリス 0.019090283152333857
98% 0.01797996934720739
テスト 0.01797996934720739
ロンドン警視庁 0.01797996934720739
警察 0.016303568082312883
顔認識機能 0.01581388440795756
監視カメラ 0.014546805945119546
https://gigazine.net/news/20180706-mit-cheetah-progress/
ボストン・ダイナミクス 0.045936847833184545
...... 0.025717530837217364
障害物 0.02525171162947315
cheetah 0.024033452280351385
足 0.018582746117517982
飛び越える 0.017226317937444207
走行 0.017226317937444207
ジャンプ 0.016543522070547958
3世代 0.01515102697768389
抜群 0.01515102697768389
https://gigazine.net/news/20180706-customers-cancel-subscriptions-online-law/
サービス 0.026082951604456672
オンライン上 0.023532066221293925
提供 0.0195622137033425
明確 0.019038764427150325
カリフォルニア州 0.017649049665970442
cancel 0.017649049665970442
自動更新型 0.017649049665970442
提示 0.017649049665970442
had 0.017649049665970442
to 0.016061938498019724
https://gigazine.net/news/20180705-mystery-of-kentucky-meat-shower/
ハゲワシ 0.03448705409254564
肉 0.029324503159288345
gohde氏 0.022991369395030426
motherboard 0.01724352704627282
瓶 0.01724352704627282
降る 0.01583149139611712
謎 0.015215504663540665
仮説 0.015166162868770486
雨事件 0.013214519403227284
? 0.011876498424664378
https://gigazine.net/news/20180705-mario-macaroni-and-cheese/
マカロニ&チーズ 0.06726319468020031
マリオ 0.05691501088324641
恐竜 0.03767518086505781
スーパーマリオワールド 0.03104455139086168
2dアニメーション 0.03104455139086168
ヨッシー 0.025870459492384732
舌 0.025870459492384732
クッパ 0.025870459492384732
少年 0.023564545845250764
パッケージ 0.02093065603614323
https://gigazine.net/news/20180701-gigazine-manga/
描く 0.024681107815303693
こちら 0.021314858009726056
募集要項 0.01687809246714378
小物 0.016689166616697216
イラスト 0.01656713578847448
gigazineマンガ大賞 0.014635680832971138
gigazineシークレットクラブ 0.014600558020194318
最後 0.01387370921092746
1話 0.01387370921092746
事 0.013502473973715024
https://gigazine.net/news/20180705-kfc-red-hot-crispy-twister/
レッドホットクリスピー 0.040414592434020194
レッドホットツイスター 0.03730577763140326
ケンタッキー 0.0335512088288156
野菜 0.02249811226206441
食べる 0.02121156478528745
衣 0.018741336127382296
マヨソース 0.018741336127382296
骨 0.01865288881570163
サクサク 0.018300659361172145
辛口チキン 0.01554407401308469
https://gigazine.net/news/20180705-virtue-labeling/
ラベル付け 0.10242459318533002
プラシーボ効果 0.035625945455766965
ラベル 0.03525076400535431
研究 0.032018474274988856
寄付 0.027042079442640712
美徳ラベリング 0.02437724672237786
人 0.020651131333006677
もの 0.018290081388115286
効果 0.01818100961931576
ミラー氏 0.017812972727883482