In 2013, Mikolov et. al published ‘Distributed Representations of Words and Phrases and their Compositionality‘, a paper about a new approach to represent words by dense vectors. This was an improvement over the alternative, representing words as one-hot vectors, as these dense vector embeddings encode some meaning of the words they represent. In other terms, words with similar meaning are be close to each other in the vector space of the embedding. For example, “blue” would be close to “red” but far from “cat”. A commonly used name for their approach is word2vec.
Even more surprisingly, the embedding seems to be able to encode some word analogies. For example, word2vec(“king”) – word2vec(“man”) + word2vec(“woman”) is very close to word2vec(“queen”).
However, word2vec was only designed for words. There are some approaches that aim to extend the idea of mapping words to fixed-length embeddings to larger structures like phrases, sentences, paragraphs or even whole documents. This article will give a (probably incomplete) overview over methods that allow to generate sentence embeddings that cluster sentences with similar meanings.
Approaches for sentence embeddings
Mikolov et. al, 2014, Distributed Representations of Sentences and Documents [gensim doc2vec]
The inventor of word2vec was also one of the first to work on sentence embeddings. Their paper ‘Distributed Representations of Sentences and Documents‘ presents a method that learns a vector that represents a phrase / sentence / paragraph / document. This vector is optimized to predict the words of the sentence it presents. These vectors can be used for tasks like paraphrase detection, document clustering, information retrieval, summarization, etc.
There is a Python implementation called Doc2Vec in gensim. You can train models on arbitrary text corpus and try different parameter.
Kiros et. al, 2015, Skip-Thought Vectors
Skip-Thought Vectors are another method to map sentences to vectors. They use LSTMs to encode the word sequence of a sentence. The output of the encoder is the sentence embedding. During training, it is fed into two decoders, one to predict the previous sentence and one to predict the next sentence. A very good explanation of their training approach can be found here.
Kusner et. al, 2015, From Word Embeddings To Document Distances
This approach is a bit different to the others, as it does not directly compute sentence embeddings. It rather uses word embeddings to compare two sentences in terms of similarity. They use a distance metric called Word Mover Distance. This distance matches word pairs of two input sentences in a way that minimizes the overall distance between the word embedding distances of the word pairs.
More to be continued