When you touch natural language processing using deep learning, I come across an unfamiliar guy called Embedding.
Translated literally into Japanese, it is ** embedded **.
~~ I don't understand the meaning ~~ I'm not sure, so I looked it up.
Converting natural language into a computable form seems to be called embedding. In many cases, it refers to ** operations that convert words, sentences, etc. into vector representations **.
There are two main reasons.
Basically, current machine learning algorithms are not designed to handle string types. Therefore, it needs to be converted into a computable form.
In addition to simply making it computable, by devising a vector representation method You will be able to express the characteristics of words and sentences in vectors.
For example, by converting words with similar meanings into close vectors **, You will be able to express the meaning (like) of a word by the distance and similarity of vectors.
I wrote it like that, but I can't really feel it unless I move it, so I'll write the code.
Let's embed it in ** Word2Vec **, a library called gensim, which is easy to implement. I used the following pre-trained model as it is.
For the time being, I will try to embed "Good morning" and "Good evening".
print(model["Good morning"])
# [ 0.36222297 -0.5308175 0.97112703 -0.50114137 -0.41576928 1.7538059
# -0.17550747 -0.95748925 -0.9604152 -0.0804095 -1.160322 0.22136442
# ...
print(model["Good evening"])
# [-0.13505702 -0.11360763 0.00522657 -0.01382224 0.03126004 0.14911242
# 0.02867801 -0.02347831 -0.06687803 -0.13018233 -0.01413341 0.07728481
# ...
You can see that the string has been converted to a vector.
.. .. .. That's why I felt like I was told, so I'll check if I can express the meaning.
Let's take a look at the cosine similarity that is often used when calculating document similarity. By the way, the cosine similarity is expressed between 0 and 1, and the closer it is to ** 1, the more similar it is **.
First, let's look at the similarity between ** "Good morning" and "Good evening" **.
print(cos_similarity(model["Good morning"],model["Good evening"]))
# 0.8513177
The score came out as 0.85 ... Seems pretty close.
Now let's look at the similarity of words to distant ones.
print(cos_similarity(model["Good morning"],model["Hijiki"]))
# 0.17866151
The score was 0.17 ... ** "Good morning" and "Hijiki" ** can be said to be far away.
It seems that the vector has a meaning in terms of experience.
I felt like I was able to grasp the image of Embedding. I heard that BERT's Embedding, which I buzzed a while ago, is really good, so I'll give it a try.
import numpy as np
import gensim
#Model loading
model_path = "entity_vector/entity_vector.model.bin"
model = gensim.models.KeyedVectors.load_word2vec_format(model_path, binary=True)
#Cosine similarity
def cos_similarity(a,b):
return np.dot(a,b) / ((np.sqrt(np.dot(a,a))) * (np.sqrt(np.dot(b,b))))
print(model["Good morning"])
print(model["Good evening"])
print(cos_similarity(model["Good morning"],model["Good evening"]))
print(cos_similarity(model["Good morning"],model["Hijiki"]))