Back in elementary school you learnt the difference between nouns, verbs, adjectives, and adverbs.
These "word classes" are not just the idle invention of grammarians, but are useful categories for many language processing tasks.
This will be useful when we come to developing automatic taggers, as they are trained and tested on lists of sentences, not words. Let's inspect some tagged text to see what parts of speech occur before a noun, with the most frequent ones first.
To begin with, we construct a list of bigrams whose members are themselves word-tag pairs such as Note that the items being counted in the frequency distribution are word-tag pairs.
In this section, we will see how to represent such mappings in Python.
is an association between a word and a part-of-speech tag.
A word frequency table allows us to look up a word and find its frequency in a text collection.
In all these cases, we are mapping from names to numbers, rather than the other way around as with a list.
Note that part-of-speech tags have been converted to uppercase, since this has become standard practice since the Brown Corpus was published.
Tagged corpora for several other languages are distributed with NLTK, including Chinese, Hindi, Portuguese, Spanish, Dutch and Catalan.
Consider the following analysis involving By convention in NLTK, a tagged token is represented using a tuple consisting of the token and the tag.