Ans Doc390


Exercise 0 (ungraded). Look up and read the translation of lorem ipsum!
Data cleaning. Like most data in the real world, this dataset is noisy. It has both uppercase and lowercase letters, words have repeated letters, and there are all sorts of non-alphabetic characters.
For our analysis, we should keep all the letters and spaces (so we can identify distinct words), but we should ignore case and ignore repetition within a word.
For example, the eighth word of this text is “error.” As an itemset, it consists of the three unique letters, e, o, r f g
Exercise 1 (normalize_string_test: 2 points). Complete the following function,
normalize_string(s). The input s is a string (str object). The function should return a new string with (a) all characters converted to lowercase and (b) all non-alphabetic, non-whitespace characters removed.
Clarification. Scanning the sample text, latin_text, you may see things that look like special cases. For instance, inci[di]dunt and [do]. For these, simply remove the non-alphabetic characters and only separate the words if there is explicit whitespace.
For instance, inci[di]dunt would become incididunt (as a single word) and [do]
would become do as a standalone word because the original string has whitespace on either side. A period or comma without whitespace would, similarly, just be treated as a non-alphabetic character inside a word unless there is explicit whitespace. So e pluribus.unum basium would become e pluribusunum basium even though your common-sense understanding might separate pluribus and unum.
Hint. Regard as a whitespace character anything “whitespace-like.” That is, consider not just regular spaces, but also tabs, newlines, and perhaps others. To detect whitespaces easily, look for a “high-level” function that can help you do so rather than checking for literal space characters.
Exercise 2 (get_normalized_words_test: 1 point). Implement the following function, get_normalized_words(s). It takes as input a string s (i.e., a str object). It should return a list of the words in s, after normalization per the definition of normalize_string(). (That is, the input s may not be normalized yet.)