This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 2, Step 02, I will write down my own points.
--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server
I created a simple dialogue agent in the previous chapter, but it does not handle similar sentences in the same way, and treats words (particles, etc.) and differences (uppercase and lowercase letters of the alphabet) that are not originally important as features. Learn the following techniques and apply them to dialogue agents.
Properly format the text before going into the text classification process.
#Cannot be treated as the same sentence
Do you like python
Do you like python
#Characterizes the commonality of particles and auxiliary verbs
#label,Sentence
0,I like you
1,I like ramen!
# ↓
#Label the sentence "I like ramen"=It may be judged as 0
#Semantic label=I want to judge it as 1
The process of absorbing fluctuations in notation and unifying it into a certain notation is called normalization of character strings. The goal is to get the same word-separated results and the same BoW, even if there are fluctuations in the notation. Approximate normalization is performed by neologdn, and normalization (lowercase and Unicode normalization) that neologdn lacks is handled individually.
neologdn There is a library called neologdn that summarizes multiple normalization processes. This is the normalization process used when generating data for NEologd, a type of MeCab dictionary. The advantage of neologdn is that it is easy to use because the normalization process is integrated into one function, and it is fast because it is implemented in C language.
Example of use
import neologdn
print(neologdn.normalize(<Sentence>))
neologdn.normalize does not include lowercase / uppercase conversion of the alphabet. Therefore, to absorb the notational fluctuation, use .lower () and .upper (), which are Python's str type built-in methods, to unify the notation to lowercase or uppercase.
However, ** proper nouns may be important to distinguish between lowercase and uppercase letters, so take appropriate measures **.
Unicode is now widely used as the de facto standard for character encoding. "De" and "Co., Ltd." and "de", which is a combination of the single characters "de" and "te and" even if they are the same "de", are treated as separate characters as they are, so the result of Bow is naturally different. It ends up.
In Unicode, characters are represented by ** code points **. (Hexadecimal notation) They can be converted to each other using Python's built-in functions ord () and chr (), respectively.
Unicode and code point examples
>>> hex(ord('Ah'))
'0x3042'
>>> chr(0x3042)
'Ah'
#By the way, decimal notation is also possible
>>> ord('Ah')
12354
>>> chr(12354)
'Ah'
Next, for the character "de", check the code point for one character and for the combined character string (base character and combined character).
De code point confirmation
#One letter
>>> chr(0x30C7)
'De'
#Combined string
>>> chr(0x30C6)
'Te'
>>> chr(0x3099)
'S'
>>> chr(0x30C6) + chr(0x3099)
'De'
As mentioned above, Unicode responded to this problem, which means that there are multiple ways of expressing the same character, by ** "defining a set of code points that should be treated as the same character" **. This is called Unicode equivalence, and there are the following two.
--Canonical equivalence --Equivalent to characters that look and function the same --"De" and "Te" + "" " --Compatibility equivalence --Although the appearance and function may be different, those based on the same character are regarded as equivalent. --Including canonical equivalence --"Te" and "Te"
Unicode normalization is to decompose and synthesize precomposed characters based on this equivalence, and there are the following four. Canonical means canonical and Compatibility means compatibility.
When actually performing Unicode normalization, it is necessary to ** decide which normalization to use according to the problem handled by the application and the nature of the data **.
Correcting changes in word form due to conjugation and correcting it to the form listed in the dictionary heading is called heading word conversion. However, at this point, the same characteristics have not yet been extracted by "reading a book" and "reading a book". By corresponding to the stop word in the next section, it can be treated as the same feature.
I read a book
I read a book
↓ Word-separation+Headline conversion
Read a book
I read a book
It is similar to the above-mentioned normalization in terms of absorbing notational fluctuations, but it is often described together with the word-separation process in order to correct the word-separation.
If you use node.feature obtained from parseToNode of MeCab, you can get the original form ** from the 6th element separated by commas.
However, ** words whose original form is not registered use the surface form **.
** BOS / EOS ** is a pseudo word that represents the beginning and end of a sentence as a result of MeCab, so it should not be included in the result of word-separation.
In the previous section, the word is the same until "reading a book" as a result of word-separation, but after that, "da" and "masuta" are different, so the BoW is also different. It does not have a significant effect on the meaning of the sentence, and it is not desirable from the viewpoint of memory and storage efficiency when it is included in the vocabulary.
Prepare a list of exclusion words in advance as shown below, and make a judgment using an if statement. In some cases, you can prepare the necessary stopword list from the net, such as slothlib.
~~
stop_words = ['hand', 'To', 'To', 'Is', 'is', 'Masu']
~~
if token not in stop_words:
result.append(token)
Particles and auxiliary verbs are important parts of speech in writing sentences, but they are not necessary in expressing the meaning of a sentence (in a dialogue agent, the characteristics necessary for class ID classification are acquired).
~~
if features[0] not in ['Particle', 'Auxiliary verb']:
~~
As in the previous section, it is important as a sentence, but since "numerical value and date and time" may not have much meaning in expressing the meaning of the sentence, replace it with a specific character string.
#Before conversion
I bought an egg
I bought two eggs
I bought 10 eggs
#After conversion
I bought SOMENUMBER eggs
I bought SOMENUMBER eggs
I bought SOMENUMBER eggs
――Although the information on the number of eggs is lost, the meaning of "I bought an egg" remains the same, and the difference in the number of eggs can be unified. --Include a half-width space before and after "SOME NUMBER" to prevent it from being combined with the characters before and after it in a word-separated manner. --The same result can be obtained by including "SOME NUMBER" and a half-width space, but avoid it because the number of dimensions will increase by one unnecessarily.
As mentioned at the beginning, apply the following techniques learned in this chapter to the dialogue agent.
~~
# _tokenize()Improvement of
def _tokenize(self, text):
text = unicodedata.normalize('NFKC', text) #Unicode normalization
text = neologdn.normalize(text) #Normalization with neologdn
text = text.lower() #Lowercase alphabet
node = self.tagger.parseToNode(text)
result = []
while node:
features = node.feature.split(',')
if features[0] != 'BOS/EOS':
if features[0] not in ['Particle', 'Auxiliary verb']: #Removal of stop words by part of speech
token = features[6] \
if features[6] != '*' \
else node.surface #Headline conversion
result.append(token)
node = node.next
return result
Execution result
# evaluate_dialogue_agent.Fixed loading module name of py
from dialogue_agent import DialogueAgent
↓
from dialogue_agent_with_preprocessing import DialogueAgent
$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python evaluate_dialogue_agent.py
0.43617021
--Normal implementation (Step01): 37.2% --Addition of preprocessing (Step02): 43.6%