Hands-on Project: Word Count
August 02, 2018
In this project, you’ll write a program to calculate and display the number of times each word appears in Life, the Universe and Everything! (or any other book). In the process, you’ll get to practice almost all the Python skills you have acquired over the course, including command-line arguments, file input-output, strings and string methods, functions, lists and dictionaries, custom sorting, loops and if-statements. Let’s get started! (Sidenote: Life, the Universe and Everything! is the third book in the Hitchhiker’s Guide to Galaxy science fiction series by Douglas Adams).
Note: You need to have Python installed on your computer to be able to do this project.
Overview
Although we have provided guidance and instructions for the project, you’ll be writing all the code for this project. You are to read text from a file, and output the number of times each word appears in the file.
Here’s the sample expected interaction. The following command:
python3 wordcount.py hitch3.txt
should produce two output files, most_popular.txt
and alphabetical.txt
:
most_popular.txt
the 3076
and 1599
of 1490
to 1371
a 1344
he 1248
it 1061
was 917
... more lines ...
robot 55
any 54
made 54
will 54
eyes 53
how 53
too 53
anything 51
galaxy 51
mind 51
round 51
got 50
nothing 50
rather 50
right 50
being 49
sky 49
... many more lines ...
alphabetical.txt
' 11
'cos 2
'em 1
'strue 1
- 83
--indeed 1
1 1
10 1
108 4
11 1
... more lines ...
about 173
above 20
abrupt 1
abruptly 1
absence 2
absolute 5
absolutely 2
abstractedly 1
... many more lines ...
Keep reading for more detailed instructions. If you feel confident, try downloading the data and not looking at the rest of the instructions (or use as little as needed). Once you’re done with coding, take the quiz!
Guidelines
- Step 1: Download the data
- You can download the data from the following URL: hitch3.txt. This file contains a plain-text version of Life, the Universe and Everything! the third book in the Hitchhiker’s Guide to Galaxy science fiction series. (Sample included below)
- Tip: I suggest creating another file, say
hitch3small.txt
, which only has the first 50 lines or so. It will make it easier to print out what your code is doing and look at the output. - Step 2: Get filename from command-line and read the input
- Use the sys module to get the filename from command-line, and then read the file. Relevant tutorials:
- Python3 Modules and Command-line execution
- Python3 Sorting and File input-output
- Tip: After every step, keep printing out the values of your intermediate variables to check if everything is working as you expect it to.
- Step 3: Split the text into a list of words
- For this exercise, we’ll define a word as any sequence consisting of alphabets (a-z, A-Z), digits (0-9), apostrophe (’) or hyphens (-). For example,
"You're a jerk, Dent," it said simply.
has the following words:You're
,a
,jerk
,Dent
,it
,said
andsimply
. - You might find it helpful to define
is_legal(chr)
which returnsTrue
ifchr
is among the characters mentioned above. - Relevant tutorial: Python3 Lists and Loops
- Step 4: Count the number of occurrences of each word.
- When counting, convert words to lowercase. So
Hello
,HELLO
,hello
are all considered towards the count forhello
. - Relevant tutorial: Python3 Dictionaries and Tuples
- Step 5: Sort the items by word count and output to
most_popular.txt
- Most popular first. Break ties by alphabetical ordering. In the example above,
robot
comes beforeany
because word count forrobot
is higher. Butany
comes beforemade
because they have the same word count, butany
is earlier in alphabetical order. - Relevant tutorial: Python3 Sorting and File input-output
- Step 6: Sort the words by alphabetical order and output to
alphabetical.txt
- Step 7: Double check everything works as expected and take the quiz!
Sample of hitch3.txt
The file begins as follows:
Douglas Adams
Life, the Universe, and Everything
=================================================================
Douglas Adams The Hitch Hiker's Guide to the Galaxy
Douglas Adams The Restaurant at the End of the Universe
Douglas Adams Life, the Universe, and Everything
Douglas Adams So long, and thanks for all the fish
=================================================================
Life, the universe and everything
for Sally
=================================================================
Chapter 1
The regular early morning yell of horror was the sound of Arthur
Dent waking up and suddenly remembering where he was.
It wasn't just that the cave was cold, it wasn't just that it was
damp and smelly. It was the fact that the cave was in the middle
of Islington and there wasn't a bus due for two million years.
...
Solution
The solution to this project is included at the end of the quiz.