Dive into Deep Learning
Table Of Contents
Dive into Deep Learning
Table Of Contents

Finding Synonyms and Analogies

In the “Implementation of Word2vec” section, we trained a word2vec word embedding model on a small-scale data set and searched for synonyms using the cosine similarity of word vectors. In practice, word vectors pre-trained on a large-scale corpus can often be applied to downstream natural language processing tasks. This section will demonstrate how to use these pre-trained word vectors to find synonyms and analogies. We will continue to apply pre-trained word vectors in later chapters.

Using Pre-trained Word Vectors

MXNet’s contrib.text package provides functions and classes related to natural language processing (see the GluonNLP tool package[1] for more details). Next, we will look at the name of the pre-trained word embeddings it currently provides.

In [1]:
from mxnet import nd
from mxnet.contrib import text

text.embedding.get_pretrained_file_names().keys()
Out[1]:
dict_keys(['glove', 'fasttext'])

Given the name of the word embedding, we can see which pre-trained models are provided by the word embedding. The word vector dimensions of each model may be different or obtained by pre-training on different data sets.

In [2]:
print(text.embedding.get_pretrained_file_names('glove'))
['glove.42B.300d.txt', 'glove.6B.50d.txt', 'glove.6B.100d.txt', 'glove.6B.200d.txt', 'glove.6B.300d.txt', 'glove.840B.300d.txt', 'glove.twitter.27B.25d.txt', 'glove.twitter.27B.50d.txt', 'glove.twitter.27B.100d.txt', 'glove.twitter.27B.200d.txt']

The general naming conventions for pre-trained GloVe models are “model.(data set.)number of words in data set.word vector dimension.txt”. For more information, please refer to the GloVe and fastText project sites [2,3]. Below, we use a 50-dimensional GloVe word vector based on Wikipedia subset pre-training. The corresponding word vector is automatically downloaded the first time we create a pre-trained word vector instance.

In [3]:
glove_6b50d = text.embedding.create(
    'glove', pretrained_file_name='glove.6B.50d.txt')

Print the dictionary size. The dictionary contains 400,000 words and a special unknown token.

In [4]:
len(glove_6b50d)
Out[4]:
400001

We can use a word to get its index in the dictionary, or we can get the word from its index.

In [5]:
glove_6b50d.token_to_idx['beautiful'], glove_6b50d.idx_to_token[3367]
Out[5]:
(3367, 'beautiful')

Applying Pre-trained Word Vectors

Below, we demonstrate the application of pre-trained word vectors, using GloVe as an example.

Seeking Synonyms

Here, we re-implement the algorithm used to search for synonyms by cosine similarity introduced in the “Implementation of Word2vec” section. In order to reuse the logic for seeking the \(k\) nearest neighbors when seeking analogies, we encapsulate this part of the logic separately in the knn (\(k\)-nearest neighbors) function.

In [6]:
def knn(W, x, k):
    cos = nd.dot(W, x.reshape((-1,))) / (
        nd.sum(W * W, axis=1).sqrt() * nd.sum(x * x).sqrt())
    topk = nd.topk(cos, k=k, ret_typ='indices').asnumpy().astype('int32')
    return topk, [cos[i].asscalar() for i in topk]

Then, we search for synonyms by pre-training the word vector instance embed.

In [7]:
def get_similar_tokens(query_token, k, embed):
    topk, cos = knn(embed.idx_to_vec,
                    embed.get_vecs_by_tokens([query_token]), k+2)
    for i, c in zip(topk[2:], cos[2:]):  # Remove input words and unknown words.
        print('cosine sim=%.3f: %s' % (c, (embed.idx_to_token[i])))

The dictionary of pre-trained word vector instance glove_6b50d already created contains 400,000 words and a special unknown token. Excluding input words and unknown words, we search for the three words that are the most similar in meaning to “chip”.

In [8]:
get_similar_tokens('chip', 3, glove_6b50d)
cosine sim=0.856: chips
cosine sim=0.749: intel
cosine sim=0.749: electronics

Next, we search for the synonyms of “baby” and “beautiful”.

In [9]:
get_similar_tokens('baby', 3, glove_6b50d)
cosine sim=0.839: babies
cosine sim=0.800: boy
cosine sim=0.792: girl
In [10]:
get_similar_tokens('beautiful', 3, glove_6b50d)
cosine sim=0.921: lovely
cosine sim=0.893: gorgeous
cosine sim=0.830: wonderful

Seeking Analogies

In addition to seeking synonyms, we can also use the pre-trained word vector to seek the analogies between words. For example, “man”:“woman”::“son”:“daughter” is an example of analogy, “man” is to “woman” as “son” is to “daughter”. The problem of seeking analogies can be defined as follows: for four words in the analogical relationship \(a : b :: c : d\), given the first three words, \(a\), \(b\) and \(c\), we want to find \(d\). Assume the word vector for the word \(w\) is \(\text{vec}(w)\). To solve the analogy problem, we need to find the word vector that is most similar to the result vector of \(\text{vec}(c)+\text{vec}(b)-\text{vec}(a)\).

In [11]:
def get_analogy(token_a, token_b, token_c, embed):
    vecs = embed.get_vecs_by_tokens([token_a, token_b, token_c])
    x = vecs[1] - vecs[0] + vecs[2]
    topk, cos = knn(embed.idx_to_vec, x, 2)
    return embed.idx_to_token[topk[1]]  # Remove unknown words.

Verify the “male-female” analogy.

In [12]:
get_analogy('man', 'woman', 'son', glove_6b50d)
Out[12]:
'daughter'

“Capital-country” analogy: “beijing” is to “china” as “tokyo” is to what? The answer should be “japan”.

In [13]:
get_analogy('beijing', 'china', 'tokyo', glove_6b50d)
Out[13]:
'japan'

“Adjective-superlative adjective” analogy: “bad” is to “worst” and “big” is to what? The answer should be “biggest”.

In [14]:
get_analogy('bad', 'worst', 'big', glove_6b50d)
Out[14]:
'biggest'

“Present tense verb-past tense verb” analogy: “do” is to “did” as “go” is to what? The answer should be “went”.

In [15]:
get_analogy('do', 'did', 'go', glove_6b50d)
Out[15]:
'went'

Summary

  • Word vectors pre-trained on a large-scale corpus can often be applied to downstream natural language processing tasks.
  • We can use pre-trained word vectors to seek synonyms and analogies.

Problems

  • Test the fastText results. It is worth mentioning that fastText has a pre-trained Chinese word vector (pretrained_file_name=‘wiki.zh.vec’).
  • If the dictionary is extremely large, how can we improve the synonym and analogy search speed?

Reference

[1] GluonNLP tool package. https://gluon-nlp.mxnet.io/

[2] GloVe project website. https://nlp.stanford.edu/projects/glove/

[3] fastText project website. https://fasttext.cc/

Discuss on our Forum