a) Using one of the Corpora in the last lab. Calculate the average “Tokens” per sentence.
b) Using the same or different corpus, which category has the longest sentences on average, which has the shortest?
2) Download your own “Corpus” on https://www.gutenberg.org/ (Links to an external site.)
a) How many sentences are in the document (use NLTK to split the sentences)? How does this differ from the amount of lines in the file (readlines)?
b) After tokenizing the sentences, find 3 errors and describe why you think this error might of occurred. What in the algorithm might have gone wrong?