3. Natural language processing with Python 2-2. Co-occurrence network [mecab-ipadic-NEologd]

Task

** 1. Preparation of text data **

⑴ Reading text data

from google.colab import files
uploaded = files.upload()

image.png

with open('20200926_suga_un.txt', mode='rt', encoding='utf-8') as f:
    read_text = f.read()
sugatxt = read_text

image.png

⑵ Data cleaning

#Delete unnecessary characters / symbols
def clean(text):
    text = text.replace("\n", "")
    text = text.replace("\u3000", "")
    text = text.replace("「", "")
    text = text.replace("」", "")
    text = text.replace("(", "")
    text = text.replace(")", "")
    text = text.replace("、", "")
    return text

text = clean(sugatxt)

#Split line by line
lines = text.split("。")

image.png

** 2. Creating co-occurrence data **

⑶ Installation of MeCab and mecab-ipadic-NEologd

# MeCab
!apt-get -q -y install sudo file mecab libmecab-dev mecab-ipadic-utf8 git curl python-mecab > /dev/null
!pip install mecab-python3 > /dev/null

# mecab-ipadic-NEologd
!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git > /dev/null 
!echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n > /dev/null 2>&1

#Error avoidance by symbolic links
!ln -s /etc/mecabrc /usr/local/etc/mecabrc

⑷ Create an instance by specifying mecab-ipadic-NEologd

#Check the dictionary path
!echo `mecab-config --dicdir`"/mecab-ipadic-neologd"

image.png

import MeCab

path = "-d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd"
m_neo = MeCab.Tagger(path)

⑸ Create a sentence-based noun list

stopwords = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "0", 
             "1", "2", "3", "4", "5", "6", "7", "8", "9", "0", 
             "one", "two", "three", "four", "Five", "Six", "Seven", "Eight", "Nine", "〇", 
             "Year", "Month", "Day", "Next", "Discount", "Times", "Target", "Disease", "that's all", "Less than", "周Year", "Case", "Every time",
             "of", "もof", "thing", "Yo", "Sama", "For", "Tend to", "this", "It", "that", "Who", 
             "*", ",", ","]
noun_list  = []

for line in lines:
    result = []
    v1 = m_neo.parse(line)
    v2 = v1.splitlines()
    for v in v2:
        v3 = v.split("\t")
        if len(v3) == 2:
            v4 = v3[1].split(',')
            if (v4[0] == "noun") and (v4[6] not in stopwords):
                 result.append(v4[6])
    noun_list.append(result)

image.png

⑹ Generation of co-occurrence data

import itertools #A module that collects iterator functions
from collections import Counter #A class that counts the number of times a dictionary type appears

#Generate a sentence-based noun pair list
pair_list = []
for n in noun_list:
    if len(noun_list) >= 2:
        lt = list(itertools.combinations(n, 2))
        pair_list.append(lt)

#Flatten the noun pair list
all_pairs = []
for p in pair_list:
    all_pairs.extend(p)

#Count the frequency of noun pairs
cnt_pairs = Counter(all_pairs)

image.png

** 3. Draw network diagram **

⑺ Creation of drawing data

import pandas as pd
import numpy as np

#Generate the top 30 pairs of dictionaries
dict = sorted(cnt_pairs.items(), key=lambda x:x[1], reverse=True)[:30]

#Convert dict type to 2D array
result = []
for key, value in dict:
    temp = []
    for k in key:
        temp.append(k)
    temp.append(value)
    result.append(temp)

data = np.array(result)

image.png

⑻ Import of visualization library

import matplotlib.pyplot as plt
import networkx as nx
%matplotlib inline 

#Module to make matplotlib support Japanese display
!pip install japanize-matplotlib
import japanize_matplotlib

⑼ Visualization by NetworkX

#Generating a graph object
G = nx.Graph()

#Data reading
G.add_weighted_edges_from(data)

#Drawing a graph
plt.figure(figsize=(10,10))
nx.draw_networkx(G,
                 node_shape = "s",
                 node_color = "chartreuse", 
                 node_size = 800,
                 edge_color = "gray", 
                 font_family = "IPAexGothic") #Japanese font specification

plt.show()

image.png

mecab-ipadic-NEologd MeCab standard
"Infection" "infection", "Disease"
"Developing countries" "On the way", "Country"
"Association of Southeast Asian Nations" "Southeast Asia", "Countries", "Union"
"Human security" "Human", "of", "safety", "security"

Recommended Posts

3. Natural language processing with Python 2-2. Co-occurrence network [mecab-ipadic-NEologd]
3. Natural language processing with Python 2-1. Co-occurrence network
Python: Natural language processing
3. Natural language processing with Python 1-1. Word N-gram
[Python] I played with natural language processing ~ transformers ~
100 Language Processing with Python Knock 2015
3. Natural language processing with Python 4-1. Analysis for words with KWIC
Building an environment for natural language processing with Python
Study natural language processing with Kikagaku
100 Language Processing Knock with Python (Chapter 1)
[Natural language processing] Preprocessing with Japanese
100 Language Processing Knock with Python (Chapter 3)
3. Natural language processing with Python 5-2. Emotion intensity analysis tool VADER
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
I tried natural language processing with transformers.
Dockerfile with the necessary libraries for natural language processing in python
Getting started with Python with 100 knocks on language processing
RNN_LSTM2 Natural language processing
Python: Deep Learning in Natural Language Processing: Basics
Image processing with Python
Let's enjoy natural language processing with COTOHA API
[Natural language processing] Extract keywords from Kakenhi database with MeCab-ipadic-neologd and termextract
3. Natural language processing with Python 1-2. How to create a corpus: Aozora Bunko
3. Natural language processing with Python 3-1. Important word extraction tool TF-IDF analysis [original definition]
3. Natural language processing with Python 3-3. A year of corona looking back at TF-IDF
[Chapter 5] Introduction to Python with 100 knocks of language processing
Model using convolutional neural network in natural language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
Image processing with Python (Part 2)
"Apple processing" with OpenCV3 + Python3
[Co-occurrence analysis] Easy co-occurrence analysis with Python! [Python]
Neural network with Python (scikit-learn)
Acoustic signal processing with Python (2)
Acoustic signal processing with Python
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
Image processing with Python (Part 1)
Natural language processing 1 Morphological analysis
Natural language processing 3 Word continuity
Image processing with Python (Part 3)
Python: Natural language vector representation
Network programming with Python Scapy
Natural language processing 2 Word similarity
[Python] Image processing with scikit-image
Quick batch text formatting + preprocessing for Aozora Bunko data for natural language processing with Python
[Practice] Make a Watson app with Python! # 3 [Natural language classification]
[Python] Try to classify ramen shops by natural language processing
3. Natural language processing with Python 5-5. Emotion value analysis of Japanese sentences [Japanese evaluation polarity dictionary (words)]
3. Natural language processing with Python 5-3. Emotion value analysis of Japanese sentences [Word emotion polarity value correspondence table]
3. Natural language processing with Python 3-4. A year of corona looking back on TF-IDF [Data creation]
[Python] Easy parallel processing with Joblib
Neural network with OpenCV 3 and Python 3
100 natural language processing knocks Chapter 4 Commentary
100 Language Processing Knock Chapter 1 in Python
Measuring network one-way delay with python
Natural language processing for busy people
Image processing with Python 100 knocks # 3 Binarization
Artificial language Lojban and natural language processing (artificial language processing)