Lemmatization is the process of converting words (e.g. in a sentence) to their stemming while respecting their context. For example, the sentence “You are not better than me” would become “You be not good than me”. This is useful when dealing with NLP preprocessing, for example to train doc2vec models. The python module nltk.stem contains a class called WordNetLemmatizer. In order to use it, one must provide both the word and its part-of-speech tag (adjective, noun, verb, …) because lemmatization is highly dependent on context.
To my knowledge, there is no pre-defined function that takes a whole sentence and outputs the lemmatized sentence. Therefore I came up with this code:
import nltk from nltk.stem import WordNetLemmatizer from nltk.corpus import wordnet lemmatizer = WordNetLemmatizer() def nltk2wn_tag(nltk_tag): if nltk_tag.startswith('J'): return wordnet.ADJ elif nltk_tag.startswith('V'): return wordnet.VERB elif nltk_tag.startswith('N'): return wordnet.NOUN elif nltk_tag.startswith('R'): return wordnet.ADV else: return None def lemmatize_sentence(sentence): nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence)) wn_tagged = map(lambda x: (x[0], nltk2wn_tag(x[1])), nltk_tagged) res_words = [] for word, tag in wn_tagged: if tag is None: res_words.append(word) else: res_words.append(lemmatizer.lemmatize(word, tag)) return " ".join(res_words)
The first step is to convert the sentence to a list of tuples where every tuple contains both the word and its part-of-speech tag. Since [python]WordNetLemmatizer[/python] expects a different kind of POS tags, we have to convert the ones generated by [python]nltk.pos_tag()[/python] to those expected by [python]WordNetLemmatizer.lemmatize()[/python]. This is done in [python]nltk2wn_tag()[/python].
There are some POS tags that correspond to words where the lemmatized form does not differ from the original word. For these, [python]nltk2wn_tag()[/python] returns None and [python]lemmatize_sentence()[/python] just copies them from the input to the output sentence.