This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 2, Step 02, I will write down my own points.

Preparation

--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server

Chapter overview

I created a simple dialogue agent in the previous chapter, but it does not handle similar sentences in the same way, and treats words (particles, etc.) and differences (uppercase and lowercase letters of the alphabet) that are not originally important as features. Learn the following techniques and apply them to dialogue agents.

Normalization --Normalization by neologdn --Lowercase alphabet --Unicode normalization --Removal of stop words by part of speech --Heading

02.1 What is preprocessing?

Properly format the text before going into the text classification process.

#Cannot be treated as the same sentence
Do you like python
Do you like python

#Characterizes the commonality of particles and auxiliary verbs
#label,Sentence
0,I like you
1,I like ramen!
# ↓
#Label the sentence "I like ramen"=It may be judged as 0
#Semantic label=I want to judge it as 1

02.2 Normalization

The process of absorbing fluctuations in notation and unifying it into a certain notation is called normalization of character strings. The goal is to get the same word-separated results and the same BoW, even if there are fluctuations in the notation. Approximate normalization is performed by neologdn, and normalization (lowercase and Unicode normalization) that neologdn lacks is handled individually.

neologdn There is a library called neologdn that summarizes multiple normalization processes. This is the normalization process used when generating data for NEologd, a type of MeCab dictionary. The advantage of neologdn is that it is easy to use because the normalization process is integrated into one function, and it is fast because it is implemented in C language.

`Example of use`


import neologdn

print(neologdn.normalize(<Sentence>))

Lowercase and uppercase

neologdn.normalize does not include lowercase / uppercase conversion of the alphabet. Therefore, to absorb the notational fluctuation, use .lower () and .upper (), which are Python's str type built-in methods, to unify the notation to lowercase or uppercase.

However, ** proper nouns may be important to distinguish between lowercase and uppercase letters, so take appropriate measures **.

Unicode normalization overview

Unicode is now widely used as the de facto standard for character encoding. "De" and "Co., Ltd." and "de", which is a combination of the single characters "de" and "te and" even if they are the same "de", are treated as separate characters as they are, so the result of Bow is naturally different. It ends up.

Detailed explanation of Unicode normalization

In Unicode, characters are represented by ** code points **. (Hexadecimal notation) They can be converted to each other using Python's built-in functions ord () and chr (), respectively.

`Unicode and code point examples`


>>> hex(ord('Ah'))
'0x3042'
>>> chr(0x3042)
'Ah'

#By the way, decimal notation is also possible
>>> ord('Ah')
12354
>>> chr(12354)
'Ah'

Next, for the character "de", check the code point for one character and for the combined character string (base character and combined character).

`De code point confirmation`


#One letter
>>> chr(0x30C7)
'De'

#Combined string
>>> chr(0x30C6)
'Te'
>>> chr(0x3099)
'S'
>>> chr(0x30C6) + chr(0x3099)
'De'

As mentioned above, Unicode responded to this problem, which means that there are multiple ways of expressing the same character, by ** "defining a set of code points that should be treated as the same character" **. This is called Unicode equivalence, and there are the following two.

--Canonical equivalence --Equivalent to characters that look and function the same --"De" and "Te" + "" " --Compatibility equivalence --Although the appearance and function may be different, those based on the same character are regarded as equivalent. --Including canonical equivalence --"Te" and "Te"

Unicode normalization is to decompose and synthesize precomposed characters based on this equivalence, and there are the following four. Canonical means canonical and Compatibility means compatibility.

NFD（Normalization Form Canonical Decomposition） --Decomposition by canonical equivalence
NFC（Normalization Form Canonical Composition） --Decomposition by canonical equivalence → Synthesis by canonical equivalence
NFCD（Normalization Form Compatibility Decomposition） --Decomposition by compatibility equivalence
NFKC（Normalization Form Compatibility Composition） --Decomposition by compatibility equivalence → Synthesis by compatibility equivalence

When actually performing Unicode normalization, it is necessary to ** decide which normalization to use according to the problem handled by the application and the nature of the data **.

02.3 Headwordization

Correcting changes in word form due to conjugation and correcting it to the form listed in the dictionary heading is called heading word conversion. However, at this point, the same characteristics have not yet been extracted by "reading a book" and "reading a book". By corresponding to the stop word in the next section, it can be treated as the same feature.

I read a book
I read a book

↓ Word-separation+Headline conversion

Read a book
I read a book

Story when implementing

It is similar to the above-mentioned normalization in terms of absorbing notational fluctuations, but it is often described together with the word-separation process in order to correct the word-separation.

If you use node.feature obtained from parseToNode of MeCab, you can get the original form ** from the 6th element separated by commas.

However, ** words whose original form is not registered use the surface form **.

** BOS / EOS ** is a pseudo word that represents the beginning and end of a sentence as a result of MeCab, so it should not be included in the result of word-separation.

02.4 Stop word

In the previous section, the word is the same until "reading a book" as a result of word-separation, but after that, "da" and "masuta" are different, so the BoW is also different. It does not have a significant effect on the meaning of the sentence, and it is not desirable from the viewpoint of memory and storage efficiency when it is included in the vocabulary.

Dictionary-based stopword removal

Prepare a list of exclusion words in advance as shown below, and make a judgment using an if statement. In some cases, you can prepare the necessary stopword list from the net, such as slothlib.

~~
stop_words = ['hand', 'To', 'To', 'Is', 'is', 'Masu']

~~
if token not in stop_words:
  result.append(token)

Part-of-speech-based stopword removal

Particles and auxiliary verbs are important parts of speech in writing sentences, but they are not necessary in expressing the meaning of a sentence (in a dialogue agent, the characteristics necessary for class ID classification are acquired).

~~

if features[0] not in ['Particle', 'Auxiliary verb']:
~~

02.5 word replacement

As in the previous section, it is important as a sentence, but since "numerical value and date and time" may not have much meaning in expressing the meaning of the sentence, replace it with a specific character string.

#Before conversion
I bought an egg
I bought two eggs
I bought 10 eggs

#After conversion
I bought SOMENUMBER eggs
I bought SOMENUMBER eggs
I bought SOMENUMBER eggs

――Although the information on the number of eggs is lost, the meaning of "I bought an egg" remains the same, and the difference in the number of eggs can be unified. --Include a half-width space before and after "SOME NUMBER" to prevent it from being combined with the characters before and after it in a word-separated manner. --The same result can be obtained by including "SOME NUMBER" and a half-width space, but avoid it because the number of dimensions will increase by one unnecessarily.

02.6 Application to dialogue agent

As mentioned at the beginning, apply the following techniques learned in this chapter to the dialogue agent.

Normalization --Normalization by neologdn --Lowercase alphabet --Unicode normalization --Removal of stop words by part of speech --Heading

~~

# _tokenize()Improvement of
    def _tokenize(self, text):
        text = unicodedata.normalize('NFKC', text)  #Unicode normalization
        text = neologdn.normalize(text)  #Normalization with neologdn
        text = text.lower()  #Lowercase alphabet

        node = self.tagger.parseToNode(text)
        result = []
        while node:
            features = node.feature.split(',')

            if features[0] != 'BOS/EOS':
                if features[0] not in ['Particle', 'Auxiliary verb']:  #Removal of stop words by part of speech
                    token = features[6] \
                            if features[6] != '*' \
                            else node.surface  #Headline conversion
                    result.append(token)

            node = node.next

        return result

`Execution result`


# evaluate_dialogue_agent.Fixed loading module name of py
from dialogue_agent import DialogueAgent
↓
from dialogue_agent_with_preprocessing import DialogueAgent

$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python evaluate_dialogue_agent.py
0.43617021

--Normal implementation (Step01): 37.2% --Addition of preprocessing (Step02): 43.6%

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 02 Memo "Pre-processing"

Contents