WordNet is a base of words existing in the English language. Lemmatizer from NLTK WordNetLemmatizer () uses words from WordNet.

Contents

What is WordNet How Money Works
What is WordNet

Word2Vec

Word2vec accepts a large corpus of text, in which each word in a fixed dictionary is represented as a vector. Then the algorithm runs through each position t in the text, which is the central word c and the context word o. Next, the similarity of the word vectors for c and o is used to calculate the probability o for a given c (or vice versa), and the word vector continues to be adjusted to maximize this probability.

To achieve the best Word2vec result, useless words (or words with a high frequency of occurrence, in English - a, the, of, then) are removed from the dataset. This will help improve model accuracy and shorten training time. In addition, negative sampling is used for each input, updating the weights for all valid labels, but only on a small number of invalid labels.

Skip-Gram: a context window containing k consecutive words is considered. Then one word is skipped and a neural network is trained that contains all words except for the missing one, which the algorithm tries to predict. Therefore, if 2 words periodically share a similar context in the corpus, these words will have similar vectors. Continuous Bag of Words: Many sentences in the corpus are taken. Every time the algorithm sees a word, the adjacent word is taken. Next, context words are fed to the input of the neural network and predict the word in the center of this context. In the case of thousands of such context words and a central word, we get one instance of a dataset for our neural network (this one doctranslator is normally used for word translation). The neural network is trained and finally the output of the encoded hidden layer represents the embedding for a specific word. The same happens if the neural network trains on a large number of sentences and similar vectors are assigned to words in a similar context. The only complaint about Skip-Gram and CBOW is that they belong to the class of window-based models, which are characterized by low efficiency in the use of coincidence statistics in the corpus, which leads to suboptimal results.