NER is an abbreviation for Named Entity Recognition and is one of the tasks of natural language processing called named entity recognition. Stanford NER Tagger is a tool for solving this task. This time, I will train this myself.
First, download an example of training data. https://github.com/synalp/NER/blob/master/corpus/CoNLL-2003/eng.train
Then download Stanford NER Tagger. https://nlp.stanford.edu/software/CRF-NER.shtml#Download
Then install jdk.
apt install default-jdk
Format the downloaded eng.train.
out = []
with open("eng.train", "r") as f:
for line in f:
line = line.split()
if len(line) > 2:
out.append(str(line[0])+"\t"+str(line[-1]).replace("I-","").replace("B-","")+"\n")
else:
out.append("\n")
with open("train.tsv") as f:
f.write(''.join(out))
train.prop
trainFile = train.tsv
serializeTo = ner-model.ser.gz
map = word=0,answer=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true
Save this as a file named train.prop.
java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop train.prop
Then, the model will be created as a file named ner-model.ser.gz.
You can use the model from python with nltk's Stanford NER Tagger wrapper.
import nltk
from nltk.tag.stanford import StanfordNERTagger
sent = "Balack Obama kills people by AK47"
model = "./ner-model.ser.gz"
jar = "./stanford-ner.jar"
tagger = StanfordNERTagger(model, jar, encoding='utf-8')
print(tagger.tag(sent.split()))
[output]
[('Balack', 'PER'),
('Obama', 'PER'),
('kills', 'O'),
('people', 'O'),
('by', 'O'),
('AK47', 'O')]
[0] https://nlp.stanford.edu/software/crf-faq.html#a [1] https://blog.sicara.com/train-ner-model-with-nltk-stanford-tagger-english-french-german-6d90573a9486
Recommended Posts