I made a package that can compare morphological analyzers with Python

Content of this article

Package features

How to use

Preparation

Make file has been set up in the Github repository.

If you can make it, install it manually. Please refer to this section to install.

Sample code

A sample is shown in python3.x. If you want to see the sample for python2.x, [example code](https://github.com/Kensuke- See Mitsuzawa / JapaneseTokenizers / blob / master / examples / examples.py).

The part of speech system is summarized in detail on this page. The part of speech system of Juman / Human ++ is also described, so if you want to perform part of speech filtering with Juman / Human ++, please switch and use it.

By the way, you can also use the neologd dictionary in Human / Human ++. Please see this article. I made a script to make the neologd dictionary usable in juman / juman ++

The only difference between Mecab, Juman / Human ++, and Kytea is the class they call. It inherits the same common class.

Morphological analysis with mecab

Introducing how to use Version 1.3.1.

import JapaneseTokenizer
#Select a dictionary type."neologd", "all", "ipadic", "user", ""Can be selected.
mecab_wrapper = JapaneseTokenizer.MecabWrapper(dictType='neologd')
#Define the part of speech you want to acquire.
pos_condition = [('noun', '固有noun'), ('adjective', 'Independence')]

sentence = "The Islamic Republic of Iran, commonly known as Iran, is an Islamic republic in West Asia and the Middle East. Also known as Persia or Persia."
#Morpheme division,Part of speech filtering,Complete listing in one line
print(mecab_wrapper.tokenize(sentence).filter(pos_condition).convert_list_object())

Then the result looks like this:

['Iran Islamic Republic', 'Iran', 'West Asia', 'Middle East', 'Islamic republic', 'Persia', 'Persia']

Morphological analysis with juman / juman ++

It's basically the same as mecab. Only the class to call is different.

For Juman

from JapaneseTokenizer import JumanWrapper
tokenizer_obj = JumanWrapper()
#Define the part of speech you want to acquire.
pos_condition = [('noun', '固有noun'), ('noun', 'Place name'), ('noun', 'Organization name'), ('noun', '普通noun')]

sentence = "The Islamic Republic of Iran, commonly known as Iran, is an Islamic republic in West Asia and the Middle East. Also known as Persia or Persia."
#Morpheme division,Part of speech filtering,Complete listing in one line
print(tokenizer_obj.tokenize(sentence).filter(pos_condition).convert_list_object())
['Iran', 'Islam', 'Kyowa', 'Country', 'Known as', 'Iran', 'West', 'Asia', 'Middle East', 'Islam', 'Kyowa', 'System', 'Country家', 'Persia', 'Persia']

For Juman ++

from JapaneseTokenizer import JumanppWrapper
tokenizer_obj = JumanppWrapper()
#Define the part of speech you want to acquire.
pos_condition = [('noun', '固有noun'), ('noun', 'Place name'), ('noun', 'Organization name'), ('noun', '普通noun')]

sentence = "The Islamic Republic of Iran, commonly known as Iran, is an Islamic republic in West Asia and the Middle East. Also known as Persia or Persia."
#Morpheme division,Part of speech filtering,Complete listing in one line
print(tokenizer_obj.tokenize(sentence).filter(pos_condition).convert_list_object())
['Iran', 'Islam', 'Republic', 'Known as', 'Iran', 'West', 'Asia', 'Middle East', 'Islam', 'Republic', 'Nation', 'Persia', 'Persia']

In fact, if the text is as decent as Wikipedia, Human and Juman ++ will not change that much. When using Juman ++, it's a bit slow only on the first call. This is because it takes time to put the model file in memory. From the second time onwards, this slowness disappears because it calls the process that keeps running.

Morphological analysis with kytea

Everything is the same except for mecab, juman and the class name.

from JapaneseTokenizer import KyteaWrapper
tokenizer_obj = KyteaWrapper()
#Define the part of speech you want to acquire.
pos_condition = [('noun',)]

sentence = "The Islamic Republic of Iran, commonly known as Iran, is an Islamic republic in West Asia and the Middle East. Also known as Persia or Persia."
#Morpheme division,Part of speech filtering,Complete listing in one line
print(tokenizer_obj.tokenize(sentence).filter(pos_condition).convert_list_object())

Development history

Previously, I posted Mecab's Article that makes something like a binding Wrapper and is complacent. At this time, I made it self-sufficiently, so that's fine. However, after that, I came to think that __ "I want you to easily try the comparison of morphological analyzers" __, and I came to make it.

Reason 1 Everyone uses mecab, right?

It's just around me, but I feel like "morphological analysis? For the time being, with Mecab. Or something else?"

When I search with Qiita, there are 347 hits for mecab, but only 17 for juman. There are 3 cases for kytea.

Certainly, I think Mecab is very good as software. But, "I don't know anything else, but I think there's only Mecab, right?" Is something different, isn't it? I think.

That's why the first motivation was to make an appeal that __ "There are other than Mecab" __.

Reason 2 You don't know from a foreigner, right?

Recently, I have been to the foreign Python community living in Japan.

They are interested in Japanese processing, but they don't know which morphological analyzer is right for them.

They look it up, but they don't really understand the difference, so they're saying something messed up.

Below, the mysterious logic I've heard so far

I thought that the reason why such mysterious logic came out was that the information was not prepared and could not be compared.

It's difficult to organize the information, but it can make comparisons easier. That's why I made it. Also, I tried to write all documents in English. I thought it would be nice if the information could be gathered as much as possible.

Development policy

Common as much as possible

I designed it so that it has the same structure as possible, including the interface. The class that executes processing and the data class are all common.

Simple syntax as much as possible

We designed the syntax to realize "coding preprocessing at the fastest speed". The result is an interface that handles morpheme division and part-of-speech filtering in one line.


If you like it, please give Star ☆ to Github repository: bow_tone1:

We are also looking for people who can improve together. I would like to introduce an analyzer around here as well. RakutenMA, Chansen ...

Recommended Posts

I made a package that can compare morphological analyzers with Python
I made a shuffle that can be reset (reverted) with Python
I made a module PyNanaco that can charge nanaco credit with python
I made a fortune with Python.
I made a daemon with Python
I made a familiar function that can be used in statistics with Python
I made a package to filter time series with python
I made a plug-in that can "Daruma-san fell" with Minecraft
I made a character counter with Python
I made a Hex map with Python
I made a roguelike game with Python
I made a simple blackjack with Python
I made a configuration file with Python
I made a neuron simulator with Python
[python] I made a class that can write a file tree quickly
I made a competitive programming glossary with Python
I made a weather forecast bot-like with Python.
I made a GUI application with Python + PyQt5
I made a Twitter fujoshi blocker with Python ①
[Python] I made a Youtube Downloader with Tkinter.
I made a bin picking game with Python
I made a Mattermost bot with Python (+ Flask)
[Python] I made a utility that can access dict type like a path
I made a tool that makes decompression a little easier with CLI (Python3)
I made a Twitter BOT with GAE (python) (with a reference)
I made a Christmas tree lighting game with Python
I made blackjack with python!
I registered PyQCheck, a library that can perform QuickCheck with Python, in PyPI.
I made a net news notification app with Python
I made a VM that runs OpenCV for Python
I made a Docker image that can call FBX SDK Python from Node.js
I made a Python3 environment on Ubuntu with direnv.
I made a LINE BOT with Python and Heroku
A memo that I touched the Datastore with python
I made a python text
I made blackjack with Python.
A story that I was addicted to when I made SFTP communication with python
I made wordcloud with Python.
A story that stumbled when I made a chatbot with Transformer
I made a simple typing game with tkinter in Python
I made a LINE BOT that returns parrots with Go
I made a simple book application with python + Flask ~ Introduction ~
I made a puzzle game (like) with Tkinter in Python
I made a rigid Pomodoro timer that works with CUI
I made a Line-bot using Python!
I made a program to collect images in tweets that I liked on twitter with Python
I made a simple circuit with Python (AND, OR, NOR, etc.)
[Python] Make a graph that can be moved around with Plotly
I made a library to easily read config files with Python
[Python3] I made a decorator that declares undefined functions and methods.
[Python] I made my own library that can be imported dynamically
I want to use a wildcard that I want to shell with Python remove
[Python] A memo that I tried to get started with asyncio
I made a Nyanko tweet form with Python, Flask and Heroku
I made a lot of files for RDP connection with Python
[Python] I made an image viewer with a simple sorting function.
I made a poker game server chat-holdem using websocket with python
I made a library that adds docstring to a Python stub file.
I made a segment tree with python, so I will introduce it
I made a program that automatically calculates the zodiac with tkinter
[Python] A program that creates stairs with #