Assignments‎ > ‎

HW1b: Scala programming, text manipulation, and regular expressions

DUE: September 7, 2011 by 11am (submission on Blackboard)

Your starting point is a set of stub programs that you can download from the course website in the file These are incomplete implementations that handle a few things for you (such as code to test your code), and have comments embedded in them to help you solve the problems.

Your submission will be a zipped up file with your implementations for each problem in it. Submit it on Blackboard as <lastname>_<firstname> (tgz is fine too).

Note: there will be some new concepts (e.g. Scala functions) on this homework that were not covered in the tutorials or in class. I have provided hints and guidelines for you to be able to use them, but you should not feel shy to ask for help. A big part of programming is trying to figure out some new things based on what you already know, so give it a try, but at the end of the day, I’m here to help smooth things out for you.

If any of the instructions don’t make sense to you, please get in touch with the instructor right away.

1. List transformations and computations.

For this problem, you need to process a string from the command line, and do some simple transformations using Lists. Modify the stub file lengthSort.scala.

(a) Split the string from the command line (which you get as args(0)) and sort it. The sorting order is first according to length of each token and then alphabetically by the token (so “to” is before “and”, which is before “the”). Print out the list in this sorted order as a single line, with each token converted to the format word:length, e.g. the:3. (See example output below.)

Tip: The split method on Strings returns an Array, and you may want to work with a List instead. Just use toList on the Array to convert it to a List, e.g. “a:b:c:d”.split(“:”).toList.

(b) Working from the sorted list you obtained in (a), now output the list using a similar format, but remix it so that you have:
  • the first group of elements are the tokens of length 3 or less, in ascending order
  • the next group of elements are the tokens of length greater than 3 and less than 11, in descending order
  • the last group of elements are the tokens of length 11 or more, in ascending order
See the example output for guidance on formatting, which is similar to (a).

Example output.

$ scala lengthSort.scala "Introduction to Computational Linguistics introduces the most important data structures and algorithmic techniques underlying computational linguistics"

Sorted: to:2 and:3 the:3 data:4 most:4 important:9 introduces:10 structures:10 techniques:10 underlying:10 Linguistics:11 algorithmic:11 linguistics:11 Introduction:12 Computational:13 computational:13

Remixed: to:2 and:3 the:3 underlying:10 techniques:10 structures:10 introduces:10 important:9 most:4 data:4 Linguistics:11 algorithmic:11 linguistics:11 Introduction:12 Computational:13 computational:13

Hint: You’ll want to use pairs (Tuple2) for a number of reasons. One of the biggest advantages is that they sort first on the first element and then on the second element.

Note: You must make sure to enclose the argument to lengthSort.scala in quotes so that the entire argument is available in the first element of the args array.

2. Text manipulation
Here is a text passage from H.G.Wells, The War of the Worlds (from Project Gutenberg,

After the glimpse I had had of the Martians emerging from the cylinder in which they had come to the earth from their planet, a kind of fascination paralysed my actions. I remained standing knee-deep in the heather, staring at the mound that hid them. I was a battleground of fear and curiosity.

I did not dare to go back towards the pit, but I felt a passionate longing to peer into it. I began walking, therefore, in a big curve, seeking some point of vantage and continually looking at the sand heaps that hid these new-comers to our earth. Once a leash of thin black whips, like the arms of an octopus, flashed across the sunset and was immediately withdrawn, and afterwards a thin rod rose up, joint by joint, bearing at its apex a circular disk that spun with a wobbling motion. What could be going on there?

For this problem, you will perform series of manipulations and operations on this passage. Modify the stub program wotw.scala, looking at the comments in it for guidance. See the output at the end of this question for guidance on what your program’s output should be.

Note: even though this problem is a bit lengthy in its description, the Scala solutions can be expressed in a single line of code for every part! (But it’s okay if you use multiple lines, and in fact I recommend doing so for part (d).)

(a) In the passage, how many word tokens are there that end with the letters “ing”? (Note that we are looking for words ending in “ing”, not just occurrences of the letter sequence “ing” anywhere in a word. You do not need to deal with punctuation. Just count only the words that actually end in “ing”, ignoring those that end in “ing,” or “ing.”.) Print it out as specified in the output below.

Hint: Use the String method split to split a string of text into individual words (and split on sequences of one or more whitespace characters), the string method endsWith to test whether a string ends with a particular substring, and the List method count to find how many words pass that test.

Note: The result of split is an Array, not a List. For present purposes, this will behave similarly to a List, so you don’t need to change anything. However, if you do want a list, call the method toList, e.g. “a:b:c:d”.split(“:”).toList.

(b) Make a list chopped that contains all the words from the above passage, but with the last two letters cut off. Print out a string that concatenates the chopped tokens using a space as the separator, as below. (This is a very crude first stab at making a stemmer, a program that cuts off suffixes and leaves only word stems. We will see improved implementations later.)

Hint: Do this using map on the result of splitting the string from part (a). Also, recall the slice method, which excises a sublist from a List, e.g.:

scala> val foo = List(3,4,0,32,3,9,13)
foo: List[Int] = List(3, 4, 0, 32, 3, 9, 13)

scala> foo.slice(2,5)
res21: List[Int] = List(0, 32, 3)

Strings are also a List-like data structure (they are sequences of characters) and the slice method works for them too.

Note: It’s fine to work with either List[String] or Array[String].

(c ) A common item of interest when we get to language models is the bigram word probability of one word given that another precedes it, such as the probability of seeing the word “the” next given that you’ve just seen the word “at” in a sentence like “I’ll see you at … “ This value is expressed as P(the|at). At its most basic level, this is a simple thing to compute. Given a corpus of texts, the probability is calculated as:

 P(the|at) = count(at,the) / count(at)

In words: out of all the times I’ve seen the word “at”, how many times was it followed by the word “the”? Write the code to compute this from the WotW passage, and print out the result, as below.

Hint: There is a very useful function on lists called sliding, which gives you a sliding window of the values in the list. This is easy to see in an example:

scala> "a b c d".split(" ").sliding(2).toList
res0: List[Array[java.lang.String]] = List(Array(a, b), Array(b, c), Array(c, d))

The argument to sliding is an Int that specifies how many items will be in each window. With such a list of two-element Arrays in hand, you can now use the count function with an appropriately defined predication to compute the numerator of the probability.

(d) Identify how many words in the War of the Worlds passage are “one syllable words.” For this problem, use the (imperfect) heuristic that one syllable words consist of zero or more consonants, followed by one or more vowels, followed by zero or more consonants (i.e., there is only one group of vowels in the word). In answering this question, you should not distinguish between capitalized and non-capitalized words. The vowels to use are a, e, i, o, and u. All other lowercase letters are to be considered consonants.

For this part, use a regular expression on the raw text String, and output the number of one syllable words. You should express the above definition of a vowel pattern, and use \b to indicate a word boundary. Don’t forget to handle capitalization (so “cat”, “Cat” and “CAT” are all one syllable words.)

Hint: Use the method findAllIn of Regex to obtain all the matches as a List.

(e) Now, identify how many one syllable words (using the same definition as for (d) above) there are in the passage, but this time working with the passage as a List of Strings split on whitespace. You should get the same answer as for (d).

Goal output: (with ### substituted for actual numerical values)

$ scala wotw.scala

Part (a)
Number of words ending in 'ing': ###

Part (b)
Aft t glimp  h h  t Martia emergi fr t cylind  whi th h co  t ear fr the plane  ki  fascinati paralys  action  remain standi knee-de  t heathe stari  t mou th h the  w  battlegrou  fe a curiosit  d n da   ba towar t pi b  fe  passiona longi  pe in i  beg walkin therefor   b curv seeki so poi  vanta a continual looki  t sa hea th h the new-come  o eart On  lea  th bla whip li t ar   octopu flash acro t suns a w immediate withdraw a afterwar  th r ro u joi  join beari  i ap  circul di th sp wi  wobbli motio Wh cou  goi  ther

Part (c)
P(the|at) = ###

Part (d)
Number of one syllable words on full String: ###

Part (e)
Number of one syllable words on split String: ###

3. Basic regular expressions
Do problem 2.1 on pages 42-43 of Jurafsky and Martin. Put your solutions in the stub file regex.scala. As you work on them, run regex.scala and you’ll see that it is checking the accuracy of your regular expressions. See the comments in the stub file for details.

4. Regular expressions for capturing date expressions
There are a lot of different ways that you can express a particular date, e.g.:
  • September 7, 2011
  • 09-07-2011
  • 9/7/2011
  • the 7th of September, 2011
One thing one could do with a system that can reliably capture date expressions: to build an application that identifies meeting times in your emails and automatically transfers them to your calendar. (But note that such an application would have to be extremely reliable to be useful.)

In this problem you will use regular expressions to recognize dates, extract their components, and normalize them to a standard YEAR-MONTH-DAY format, e.g. 2011-9-7.

For this problem, modify the stub program dates.scala.

(a) Write a regex FullFormDate that covers date expressions of the form “MONTH DAY, YEAR”, where
  • MONTH is one of January, February, . . . , December
  • DAY is a number between 1 and 31 (You do not need to check that you don’t match February 31 and similar impossible dates.)
  • YEAR is a number between 1900 and 2099, so you only need to check years in the 1900s or 2000s
(b) Write a regex ShortDate that covers date expressions of the form M/D/Y and M-D-Y where
  • M is a number 1 to 12, with single digit numbers optionally preceded by 0, e.g. 01, 02.
  • D is a number 1 to 31, with single digit numbers optionally preceded by 0, e.g. 01, 02.
  • Y is a number 1900 to 2099
Note: You should not match patterns like M/D-Y or M-D/Y.

(c ) Write a regex OrdinalDate that covers date expressions of the form “the ORDINAL of MONTH, YEAR” where ORDINAL is one of 1st, 2nd, 3rd, 4th to 31st (with appropriate use of ‘st’, ‘nd’, ‘rd’, ‘th’).  Make sure not to accept 0th, 33rd, etc. MONTH and YEAR are as in part (a).

(d) Write a function normalizeDate that takes as input a date in any of the formats above and outputs it in normalized format YEAR-MONTH-DAY, e.g. “September 7, 2011”,  “09-07-2011” and “the 7th of September, 2011” all are normalized to 2011-9-7. Then, apply this function to the elements of the command line args array such that it is possible to run the program as following:

$ scala dates.scala "September 7, 2011" "12/27/1911" "4-18-2004" "the 21st of November, 1924"

You should test your program with other date expressions and make sure it does the right thing.
Jason Baldridge,
Aug 31, 2011, 3:28 PM