I really like the shoten, and when I go home, I endlessly watch the recorded shoten. Temporarily freed from research on natural language processing, take a break and laugh at ...
Chachara ~ Charara ** Cha ** ** Cha! !! !! !! !! ** **
inside-head
Get the answer:(Enter an answer that looks like a laughing point)
Yamada-kun! Take one of Sanyutei Enraku-san!
Puff! (Inspirational man)
・ Entertainment programs that have been running for a long time ・ Ogiri, a professional rakugo storyteller who gives stylish answers to the subject, is famous. ・ If you give an interesting answer, you will get a cushion. If you slip or give a rude answer, you will be taken off the cushion ・ Collect 10 cushions and get amazing products
If you enter a laughing answer, ・ Who's closest to the answer ・ How many cushions can I get? Is predicted and displayed.
Past broadcast contents released by NTV [http://www.ntv.co.jp/sho-ten/02_week/kako_2011.html](http://www.ntv.co.jp/sho-ten / 02_week / kako_2011.html) for 2011 ·Answer ·Respondent ・ Increase / decrease of cushions Record. Answers other than the six main respondents (announcer Ogiri, young Ogiri, etc.) were excluded.
The number of responses collected was 1773. It was surprising that there was not much difference, with about 330 answers per person.
I removed symbols and strange spaces from this sentence, eliminated pictograms with emoji, and unified the case with mojimoji.
I converted a sentence into a 200-dimensional vector with Word2Vec. Vectorize words in sentences using Japanese Wikipedia entity vector. (I thought it would be a good idea to use Wikipedia for the answer of laughter with many colloquial expressions, but I could not beat the learned model that can be used easily) The added average of the word vectors taken from the answers was used as the vector of the answers.
I created a classifier in a random forest. Random forest is very good because it is lightly calculated. We also optimized the parameters with Gridsearch CV. The parameter search range is Maximum depth: 1 ~ 10 Number of decision trees: 1 ~ 1000 is.
After searching for parameters, extract the one with the highest accuracy and use Pickle to save the classifier.
gridsearch.py
grid_mori_speaker = GridSearchCV(RandomForestClassifier() , grid_param_mori() , cv=10 ,scoring = 'accuracy', verbose = 3,n_jobs=-1)
grid_mori_speaker.fit(kotae_vector,shoten.speaker)
grid_mori_speaker_best = grid_mori_speaker.best_estimator_
with open('shoten_speaker_RF.pickle',mode = 'wb') as fp :
    pickle.dump(grid_mori_speaker_best,fp)
Calculate this based on the number of cushions that can be identified as the respondent, and save it as a pickle file.
By the way, the highest correct answer rate in the respondent discrimination was 0.25, and the number of cushions was 0.50. It's still quite low, so I'd like to improve steps (2) to (3) to improve accuracy.
Create a program that allows you to manually enter sentences and display the classification results. What I'm doing is decompressing the classification machine that has been made into a pickle file, inserting the sentence vector, and outputting the classification result.
shoten.py
#usr/bin/env python
#coding:utf-8
import numpy as np
import re
import emoji
import mojimoji
import MeCab
from gensim.models import KeyedVectors
import pickle
mecab = MeCab.Tagger("")#If you use Neologd dictionary, please enter the path
model_entity = KeyedVectors.load_word2vec_format("entity_vector.model.bin",binary = True)
with open('shoten_speaker_RF.pickle', mode='rb') as f:
    speaker_clf = pickle.load(f)
with open('shoten_zabuton_RF.pickle', mode='rb') as f:
    zabuton_clf = pickle.load(f)
    
def text_to_vector(text , w2vmodel,num_features):
    kotae = text
    kotae = kotae.replace(',','、')
    kotae = kotae.replace('/n','')
    kotae = kotae.replace('\t','')
    kotae = re.sub(r'\s','',kotae)
    kotae = re.sub(r'^@.[\w]+','',kotae)
    kotae = re.sub(r'https?://[\w/:%#\$&\?\(\)~\.=\+\-]+','',kotae)
    kotae = re.sub(r'[!-/:-@[-`{-~ ]+','',kotae)
    kotae = re.sub(r'[:-@, [] ★ ☆ "". , ・]+','',kotae)
    kotae = mojimoji.zen_to_han(kotae,kana = False)
    kotae = kotae.lower()
    kotae = ''.join(['' if character in emoji.UNICODE_EMOJI else character for character in kotae])
    kotae_node = mecab.parseToNode(kotae)
    kotae_line = []
    while kotae_node:
        surface = kotae_node.surface
        meta = kotae_node.feature.split(",")
        if not meta[0] == 'symbol' and not meta[0] == 'BOS/EOS':
            kotae_line.append(kotae_node.surface)
        kotae_node = kotae_node.next
    feature_vec = np.zeros((num_features), dtype = "float32")
    word_count = 0
    for word in kotae_line:
        try:
            feature_vec = np.add(feature_vec,w2vmodel[word])
            word_count += 1
        except KeyError :
            pass
        if len(word) > 0:
            if word_count == 0:
                feature_vec = np.divide(feature_vec,1)
            else:
                feature_vec = np.divide(feature_vec,word_count)
        feature_vec = feature_vec.tolist()
    return feature_vec
def zabuton_challenge(insert_text):
    vector = np.array(text_to_vector(insert_text,model_entity,200)).reshape(1,-1)
    if(zabuton_clf.predict(vector)[0] == 0):
        print(str(speaker_clf.predict(vector)[0])+"I will not give you a cushion")
    elif(zabuton_clf.predict(vector)[0] < 0):
        print("Yamada-kun!"+str(speaker_clf.predict(vector)[0])+"to Mr. or Ms"+str(zabuton_clf.predict(vector)[0])+"Give me a piece!")
    elif(zabuton_clf.predict(vector)[0] > 0):
        print("Yamada-kun!"+str(speaker_clf.predict(vector)[0])+"Of"+str(zabuton_clf.predict(vector)[0] * -1)+"Take one!")
    else:
        print("Yamada-kun! Take all the cushions of the developer who made the classifier that gives an error!")
        
if __name__ == "__main__":
    while True:
        text = input("Please answer:")
        zabuton_challenge(text)
Please forgive me for not writing many comments now. The content of the function text_to_vector () is a modification of the code written in one of the blog posts (I'm sorry I lost the source).
You can enter text by starting shoten.py. (It takes a while to read the Pickle file first, but ...)
Enter Answer to the 1st question of the 2395th broadcast on December 29, 2012 as test data. Try.

Only Koyuza, Enraku, and Kikuo are output, but depending on the answer, the other three are also output. The reason why the correct answer rate is not good is that the accuracy of the created classifier is poor. I also think that the reason why no one received cushions is that more than half of the collected data was 0 cushions.
・ Collect more data (The collection source site contains the answers from 2011 to April 2014. I want to collect more data and hit with the number of data ~~ It's very troublesome ~~) ・ Use a corpus that is strong in colloquial expressions (I couldn't find anything other than the Wikipedia corpus, so if you know something that is strong in colloquial expressions, please let me know) ・ Change the classification algorithm (I'm thinking of trying BERT because I feel like using BERT in my research)
Recommended Posts