When I try to morpheme-decompose a Japanese sentence from PyKNP using Human ++, I get this error ...!
ValueError: invalid literal for int() with base 10: 'input'
(The method of using Human, Juman ++ from Python is omitted)
File "/$HOME/.pyenv/versions/anaconda3-2019.03/lib/python3.6/site-packages/pyknp/
juman/morpheme.py", line 143, in _parse_spec
self.hinsi_id = int(parts[4])
ValueError: invalid literal for int() with base 10: 'input'
For the time being, when I look it up based on the error message
-Symbols to be careful when using JUMAN from PyKNP -Talk about the cause and countermeasures for Value Error when playing with JUMAN ++ --EnsekiTT Blog
It seems that half-width spaces and half-width characters are bad.
So replace all half-width characters with full-width characters.
Even if I corrected all half-width characters to full-width characters, the same error continued to appear. Apparently the cause is different from the situation in the above article.
So I ran it with pdb and checked the contents of the variable parts at the time of the error ~~ Do it from the beginning ~~.
(Pdb) parts
['InvalidParameter:', 'byte', 'size', 'of', 'input', 'string', '(4302)', 'is', 'greater', 'than│(base)
', 'maximum', 'allowed', '(4096)']
(It was originally a specification that the error content is included in the list that contains the analysis result when an error occurs ...)
Apparently ** the size (number of bytes) of the input string was too large **. ** The limit of the input character string seems to be 4096 bytes in total **, so it seems better to limit the capacity to less than that.
I was in the process of creating a dataset to be sent to BERT, but a sentence that is too long is a pass! ~~ UTF-8 seems to have different number of bytes depending on the character type, so it is troublesome to cut ~~
Detect statements larger than 4096 bytes under the following conditions and take some workaround.
(Split or pass)
It examines and compares the number of bytes in the string text.
if len(text.encode('utf-8')) > 4096:
Click here for how to find out the number of bytes instead of the number of characters in a string
-Python string length and number of bytes by encoding --Memoize2
The cause of the error when using Human ++ from PyKNP is combined with the article introduced above.
-Half-width space -Some half-width symbols -** Input string size is 4096 bytes or more **
was.
Recommended Posts