Assignments‎ > ‎

HW4: Part-of-speech tagging

DUE: November 5, 2011 by 11am (on Blackboard)

This assignment involves creating and user part-of-speech taggers and working with some real data. In addition to giving you experience working with part-of-speech tags and taggers, the assignment also is designed to give you practical experience with programming in a more realistic environment than you’ve seen so far, including using a build system, accessing classes defined in an API, and working with file input/output.

Note: the description of the problems for this homework is somewhat long -- don’t let this scare you. In fact, much of the text is there to give you documentation and help you know what to do.

Unless you absolutely must, don’t print this page out -- not only will you save paper, it will be far easier to cut-and-paste commands that you need to run.

If any of instructions or problem descriptions do not make sense to you, please get in touch with the instructor right away.


Getting Scalabha

You must obtain and install Scalabha, a software package that will allow you to access code written to support this assignment and to easily compile and run your own code. You can obtain it here:

Note: if you know how to use Mercurial (or are interested in using it), you can just clone the Scalabha repository.

You should unzip the contents of and follow the instructions for installing it in the file scalabha-0.1.1/README. Note that one of those instructions requires you to set an environment variable SCALABHA_DIR that points to the location of Scalabha on your machine. At times, this homework will reference file locations that are relative to that directory.

Note: there will be a tutorial forthcoming shortly that describes how to do some basic things with Scalabha that will prepare you for this homework.

The data

You will work with English part-of-speech tagged datasets that are located in $SCALABHA_DIR/data/postag/english. Go to that directory and have a look at it. It’s the same dataset that I use in another homework about HMM tagging (for my Natural Language Processing class), which I adapted from Jason Eisner. Note that there is also a Czech part-of-speech tagged dataset (which could form an optional exercise).

The tags in the English dataset come from longer, more specific tag names stripped down to their first letters. For example, all kinds of nouns (formerly NN , NNS , NNP , NNPS ) are simply tagged as N in this assignment. Using only the first letters reduces the number of tags, speeding things up. (However, it results in a couple of unnatural categories, C and P.)




Coordinating conjunction or Cardinal number




Existential there


Foreign word


Preposition or subordinating conjunction




List item marker (a., b., c., …) (rare)


Modal (could, would, must, can, might …)




Pronoun or Possessive ending ('s) or Predeterminer


Adverb or Particle


Symbol, mathematical (rare)


The word to


Interjection (rare)




wh-word (question word)


Boundary between sentences






Colon, semicolon, or dash




Quotation mark


Currency symbol

Setting up the code

Unlike previous assignments, where you wrote single-file Scala scripts, you will this time develop your solutions in the context of the Scalabha build system (which uses SBT, the Simple Build Tool). This is a new concept for most of the students in the class, but don’t worry: it will actually make things much easier for you. It will also help you acquire an important set of skills for real software development.

There are just two files in the homework bundle a stub file Tagging.scala and an answers file hw4_answers.txt. Please modify these files when solving the problems; do not use different names for either. To prepare for working with the assignment, you should do the following steps:



$ unzip


 inflating: src/main/scala/icl/hw4/Tagging.scala  

 inflating: src/main/scala/icl/hw4/hw4_answers.txt  

$ scalabha build compile

The last command should end with a line starting with [success]. If you are having trouble with this, get in touch with the instructor right away.

Your implemented solutions will be done in Tagging.scala. Any portions of problems that begin with Question or request example output should go in hw4_answers.txt.

Tip: while you are working on your solutions, you should definitely take advantage of the ~compile command in SBT. It will compile your code automatically every time you save the file Tagging.scala.

Note that most of the commands suggested in the problems assume that you are running them in the $SCALABHA_DIR/data/postag/english directory.

Submitting your solutions

For submission, create a zip file with the name hw4_<lastname>_<firstname>.zip that contains your src/main/scala/icl/hw4 directory and its contents. Here are some example commands for doing this:


$ zip -r src/main/scala/icl/hw4/

 adding: src/main/scala/icl/hw4/ (stored 0%)

 adding: src/main/scala/icl/hw4/hw4_answers.txt (deflated 1%)

 adding: src/main/scala/icl/hw4/Tagging.scala (deflated 61%)

Make sure that your code compiles before you do this. If it does not, I’ll deduct 20 points right away and ask you to fix it and resubmit.

1. Input and output of tagged sentences (10 points)

In Tagging.scala, fill in the methods read and write of the TaggedFileHelper object.

Part (a)

The read function takes a file name that contains part-of-speech tagged words and produces a List of word/tag pairs. Here is the signature of the function:

def read (filename: String): List[(String, String)]

For example, given a file with the contents:






the read function should return


Suggestion: you may want to define a regular expression as a field of TaggedFileHelper that can parse expressions like When/W into word and tag values.

Suggestion: use to read in the file.

Part (b)

The write function takes a file name and a sequence of word/tag pairs and writes them to the given file name in the word/tag format. Basically, it does the reverse of the read function. Here is its signature:

def write (filename: String, wordTagSequence: List[(String, String)])

Suggestion: use the class to handle the output. Here's an example of it working in a related problem.

val words = List("Here","are","some","words",".")
val out = new"words.txt")
words.foreach(word => out.write(word + "\n"))

If you fire up the Scala REPL and put this in, and then exit, you'll find that there is a file called words.txt in the same directory that has the following contents.


Part (c)

Test that your implementations of read and write work by running TaggedFileTest’s main method. In the data/postag/english directory, do the following.

$ scalabha run icl.hw4.TaggedFileTest entrain entrain.tmp

You should see the following output:

List((###,###), (When,W), (such,J), (claims,N), (and,C))

Then, verify that the file that was written is the same as the original:

$ diff entrain entrain.tmp

If everything worked correctly, you should see no output from running the diff command. (In other words, the original file and the one you have created by reading and writing are exactly the same.)

Question: did you successfully complete this problem?

2. Using a tagger (20 points)

For this problem, you will enable a (very bad) rule-based tagger (which you will improve in problem 4) and a Hidden Markov Model to assign tags to words in a file.

Look at the TaggerRunner object in Tagging.scala. It has only a main function that is set up to allow different types of taggers to be created and used (which you’ll be filling out over the course of this homework). The first argument to that method (args(0)) is the type of tagger to be used; in the stub, only one type is supported: “ERB”, for English rule based tagger. The method begins like this:

def main (args: Array[String]) {

 val taggerType = args(0)

 val (tagger, evalFileName, outputFileName) = taggerType match {

     // The English rule-based tagger defined in this file.

     case "ERB" => (EnglishRuleBasedTagger, args(1), args(2))


So, if the first argument to TaggerRunner.main is “ERB”, then the tagger is EnglishRuleBasedTagger, the name of the file to evaluate on (to run the tagger on) is given by args(1), and the name of the file to output the tags assigned by the tagger is given by args(2).

What does all this mean? For the first, look at the EnglishRuleBasedTagger object in Tagging.scala. It is an object that extends RuleBasedTagger in the opennlp.scalabha.postag package. This means that EnglishRuleBasedTagger is an actual tagger instance that we can use to assign tags to words and sequences. It’s rules are provided directly in the call to the RuleBasedTagger constructor (you’ll be expanding on those in problem 4).  

The last two values are straightforward: evalFileName is the name of the file that contains the tokens you will assign tags to with the tagger, and outputFileName is the name of the file where these assigned tags will be written to.

Part (a)

The first thing you must do is enable the EnglishRuleBasedTagger to be used on the file input for evaluation. To begin, run the following command:

$ scalabha run icl.hw4.TaggerRunner ERB entrain entrain.out.erb

This says to invoke the main method of TaggerRunner and select the ERB tagger, using entrain as the evaluation file and entrain.out.erb as the output file. The problem is that there is no code to read in the word/token sequence from entrain and output the results to entrain.out.erb. Fix this at the point indicated in the code stub for TaggerRunner.main. You will need to use TaggedFileHelper and the tag function of the tagger. Here’s an outline of what you must do:

  • read in the word/tag sequence from the evaluation file

  • strip off the gold tags that are included with the file -- this means you must take the word/tag sequence and get just the word sequence

  • use the tag function of the tagger to get the sequence of tags assigned by the tagger

  • put the word sequence and the assigned tag sequence together and output them

After you do this, you should get a file with the words tagged when you run the above command again. Verify that it was done correctly by using the following command and checking that it gives the same result.

$ head -10 entrain.out.erb











Part (b)

The Scalabha class opennlp.scalabha.postag.HmmTagger is an implementation of a Hidden Markov Model. You can look at the code for it by going to:


Enable the use of this tagger by providing code for the “HMM” case in TaggerRunner  such that you can run the HMM tagger as follows:

$ scalabha run icl.hw4.TaggerRunner HMM entrain entest entest.out.hmm

Here, entrain is the file that contains the word/tag sequences used to acquire the HMM’s parameters, entest is the file to assign tags to, and entest.out.hmm is the file to output the HMM’s assigned tags to.

To create an instance of HmmTagger, use the HmmTrainer object’s apply method, which has the following signature:

def apply (wordTagSequence: List[(String, String)], lambda: Double)

So, it conveniently takes a word/tag sequence that the method produces. The second argument is a small factor for smoothing the transition and emission distributions -- you can just provide the value 0.1 for that.

After you have enabled the HMM tagger, you should be able to run the above command. Note that it may take from 1-3 minutes to complete the tagging (feel free to tag the smaller file entrain4k while you are getting things to work). The output should match the following:

$ head -10 entest.out.hmm











Part (c)

Question: Briefly list any major challenges you experienced while doing this problem.

3. Scoring predicted tags versus the gold standard (20 points)

As we are creating and working with different taggers for this assignment, we’d like to be able to compare their performance. And, in the case of the improvements you’ll do to the EnglishRuleBasedTagger, you’ll need to see how your rules change performance on a development set as you make changes to the rules.

For this problem, modify the TagScorer object’s main method. After you’ve completed parts (a) and (b), you should obtain the following output when you score the tags assigned by the HMM tagger (from the previous question).

$ scalabha run icl.hw4.TagScorer entest entest.out.hmm

Accuracy = 94.11% (22538/23949)

Most common errors

Num    Word    G  A


18    that     D  I

18    about    R  I

16    Western  N  J

12    more     J  R

10    as       R  I

10    all      D  P

9     out      I  R

9     as       I  R

9     American N  J

8     that     D  W

Part (a)

Modify TagScorer’s main method to compute the per-token accuracy. This just means that you count how many tokens were assigned the correct tag and divide by the total number of tokens.

Note: the ### “words” are not really words, so you should not count them in this calculation.

Make sure that your output matches the accuracy given above, and in the format given. Note that the (22538/23949) part indicates that there were 22,538 correct assignments out of 23,949 tokens.

Tip: to get a Double value like 0.9410831349951981 to print as 94.11, you can use the format command. Here’s an example:

scala> val accuracy = 0.9410831349951981

accuracy: Double = 0.9410831349951981

scala> println("%1.2f".format(100*accuracy))


Part (b)

Extend TagScorer’s main method to output the ten most common error types, in the format given above. Each error type is for a given word that should have a particular gold (G) label but which was assigned (A) an incorrect label by the tagger. The “Num” value is the number of tokens that had that particular error. For example, in the output given above, the word as was incorrectly labeled I 10 times when it should have been labeled R, and it was incorrectly labeled R 9 times when it should have been labeled I.

Ensure that you get the same output as above when you run TagScorer on the output of the HMM.

Question: what is the scoring output when you train on entrain4k and evaluate on entest?

Part (c)

Question: Briefly list any major challenges you experienced while doing this problem.

4. Extend the EnglishRuleBasedTagger (25 points)

For this problem, you will modify the EnglishRuleBasedTagger by adding rules that will provide tags for words and words matching particular regular expressions. We begin with a bit of explanation.

Here’s the constructor for opennlp.scalabha.postag.RuleBasedTagger:

class RuleBasedTagger (

 exactMatchMap: Map[String, String],

 regexTagList: List[(Regex,String)],

 defaultTag: String


The first argument is a Map that associates word types with corresponding tags, for example to indicate that of has the preposition tag “I”, and has the conjunction tag “C”, and so on. The next argument is a list of regular expressions, each of which is associated with a tag. The last is a default tag to assign to any tokens that aren’t in the exactMatchMap or that aren’t matched by any of the regular expressions.

The stub implementation of EnglishRuleBasedTagger provides some initial arguments to RuleBasedTagger that you will extend and change to improve its performance. Here’s what it looks like:

object EnglishRuleBasedTagger extends RuleBasedTagger (



    // Label some words explicitly with a tag.

    ("(?i)said|say".r, "V"),

    // Label "interest" as a noun before it gets caught by the regex for -est ending words.

    ("(?i)interest".r, "N"),

    // Regex for labeling words ending "est" as adjectives.

    ("""(?i)(.{4,}est)""".r, "J")




The first argument value is a map that is pre-defined in the EnglishTagInfo object in PosTagger.scala. Have a look at it: it handles the common closed-class parts-of-speech and many of the words that are members of them, including punctuation. You don’t need to change this. If there are any words missing that you’d like to have added, you can do so in the regex list.

The second argument value is an ordered list of regexes paired with parts-of-speech, a few of which have been defined for you as examples to get you started. Basically, the RuleBasedTagger checks this list in order, and the first regex to match an input word is given the tag associated with the regex. The examples above give you a sense of what each regex/tag pair might look like. Note that you can match exact words, e.g. with said, say, and interest. That can be handy for words that have a highly predominant part-of-speech, or which need to be “rescued” before being matched by a later regex. For example, interest will match the regex for capturing words ending in -est and assigning the adjective tag, so a word-specific regex that precedes the -est regex rescues it.

Note that the -est regex requires that there are at least four characters preceding -est: this stops it from matching words like test and feast. It’s a rough way of saying that we are looking to match -est when it has been used as an adjective-to-adjective derivational suffix, e.g. cool -> coolest.

Tip: the (?i) modifier tells the regex to ignore case, e.g. "(?i)interest".r matches interest, Interest, INTEREST, iNtEResT, and so on.

The final argument value indicates that any words not matching the word-tag map or the regex list will be assigned the adjective label J.

Part (a)

Extend and modify the stub definition to improve performance. Do this iteratively: update the rules, run the tagger on entrain, score the output, and then change the rules based on what you see. For starters, you’d do the following with the stub implementation:

$ scalabha run icl.hw4.TaggerRunner ERB entrain entrain.out.erb

$ scalabha run icl.hw4.TagScorer entrain entrain.out.erb

Accuracy = 52.90% (50748/95936)

Most common errors

Num    Word    G  A


485    million C  J

485    %       N  J

410    Mr.     N  J

259    year    N  J

256    company N  J

237    billion C  J

202    says    V  J

188    that    W  I

186    market  N  J

167    U.S.    N  J

Based on this output, you can start making changes to address the most common errors. Hint: for starters, consider whether the default tag should perhaps be something else. Also, make sure to handle number expressions with an appropriate regex.

Iteratively refine your tagger until the performance gets above at least 80%. As a reference point, my implementation has 14 rules and gets 86% accuracy on entrain. Note: you should not apply the tagger to entest file at this time.

Recommendation: provide a comment with each of your regular expressions similar to what is in the stub implementation. This will help me understand what you mean a rule to do, regardless of whether or not it actually does it.

Part (b)

Now that you have developed the tagger using some data to guide the creation of the rules and their ordering, you can see how it performs on a held-out set of data for evaluation. Without changing anything in your EnglishRuleBasedTagger, run it on entest and report the results you obtain from TagScorer. As a reference point, my tagger gets 85.6% accuracy on entest.

Include the TagScorer output on s in your hw4_answers.txt file.

Part (c)

Question: Briefly list any major challenges you experienced while doing this problem.

5. Learned baseline taggers (25 points)

The rule-based tagger doesn’t use any statistics from entrain (though you did do error-driven development by hand using that file). The HMM does actually learn a probabilistic model from entrain, and achieves much higher accuracy. However, it is a somewhat complex model, so it is good to compare against much simpler baseline models, such as assigning every word the most frequent tag (as determined by the training data) or assigning each word its own most frequent label, to make sure that the complexity was worthwhile.

Part (a)

Create a single tag baseline implementation by creating a BaselineTagger that labels every word with the most frequent tag overall in the training data. Uncomment the “STB” case in TaggerRunner.main and modify it so that it returns a BaselineTagger as the tagger, and args(2) and args(3) as the eval and output files, respectively.

Construct the tagger via the following steps.

  • Modify the BaselineHelper.mftOverall function so that it identifies the most frequent tag overall in the given training set. (The stub just has it return the tag “###”.)

  • BaselineTagger takes a map from words to tags in its constructor, and it uses this map to assign tags deterministically to each word. Create a map that assigns the tag “###” to the “word” “###” and that uses the most frequent tag (as determined by mftOverall) as its default.

  • Construct a BaselineTagger using the map created according to the previous instruction; this is the tagger that the “STB” case in TaggerRunner.main should return.

Check your single tag baseline implementation as follows:

$ scalabha run icl.hw4.TaggerRunner STB entrain entest entest.out.stb

$ scalabha run icl.hw4.TagScorer entest entest.out.stb

You should get an accuracy of near 30%. Include the output from the scoring in the answers file.

Part (b)

Create a most-frequent-tag-per-word baseline implementation by creating a BaselineTagger that labels every word with the tag that occurred most frequently with it in the training data. Uncomment the “PWB” case in TaggerRunner.main and modify it so that it returns a BaselineTagger as the tagger, and args(2) and args(3) as the eval and output files, respectively.

Implementing this will be very similar to the STB case of part (a), except that you should modify the BaselineHelper.mftPerWord function and use it to create the word-tag map for BaselineTagger. You should still use the most frequent tag overall as the default for the map.

Check your per-word baseline implementation as follows:

$ scalabha run icl.hw4.TaggerRunner PWB entrain entest entest.out.pwb

$ scalabha run icl.hw4.TagScorer entest entest.out.pwb

You should get an accuracy over 90%. Include the output from the scoring in the answers file.

Part (c)

The per-word baseline defaults to the most frequent tag overall, which means that it labels all words that weren’t seen in the training data (e.g. entrain) with the default label. However, the rule based tagger you created earlier can do more -- it can use the patterns you described to label words that weren’t seen before. For example, if there is a word troggiest in the eval file, the stub EnglishRuleBasedTagger will label it an adjective (J).

Construct a BaselineTagger that uses EnglishRuleBasedTagger as the default rather than the most frequent tag overall (as you did for part (b)).  Do this with the “RDB” case of TaggerRunner.main.

Note: this actually requires very little code, but might be a bit tricky for some students. With that in mind, here are some tips.

The default for a Map as supplied with withDefault is a function, not a single value. For example, you could have a Map[Int,Int] that has a default as follows:

scala> val foo = Map(1->2,5->6).withDefault(x=>x+1)

foo: scala.collection.immutable.Map[Int,Int] = Map(1 -> 2, 5 -> 6)

scala> foo(3)

res1: Int = 4

That’s an odd Map, to be sure, but it demonstrates the point. So, what function do you want? Note that you can use the EnglishRuleBasedTagger object to tag a word by using EnglishRuleBasedTagger.tagWord.

Check your rule-default baseline as follows.

$ scalabha run icl.hw4.TaggerRunner RDB entrain entest entest.out.rdb

$ scalabha run icl.hw4.TagScorer entest entest.out.rdb

The accuracy I obtained is over 94% -- in fact it beats the performance of the HMM! Your accuracy could be higher or lower than that, depending on the quality of the rules you created in EnglishRuleBasedTagger.

Question: The rule-default tagger doesn’t use any context (e.g. previous or following word or tag), whereas HMMs assign tags as sequences. Why do you think it is possible that a rule-default tagger could beat an HMM? (Tip: look at and compare the output of both taggers, including running TagScorer with one as the "gold" and the other as the "assigned".)

Include the output from your rule-default baseline implementation in the answers file.

Note: though we haven’t used one here, there are a class of taggers (including maximum entropy markov models and conditional random fields) that are trained in a way that allows them to take advantage of many more aspects of the input -- in fact the kinds of rules you created for your rule based tagger are used by these models as features, and they are assigned a weight, or importance, based on the training data. Typically, their performance will be higher than what was obtained with the models/methods explored in this homework.

Part (d)

Question: Briefly list any major challenges you experienced while doing this problem.

EXTRA. Additional suggested exercises (not required)

If you want to go further, there are a number of things you could consider doing. If you do any of these, or think of something else, write down a brief description of what you did, what the results were, and any other relevant information. This could contribute to you getting a score greater than 85% since this would constitute an above-and-beyond component of completing the homework.

Create any additional solutions in the file ExtraTagging.scala.

A. Create a ContextualRuleBasedTagger implementation that uses a sliding window to tag words in addition to using a per-word RuleBasedTagger. For example, it might assign book the “V” tag if the previous word is a modal verb like “can” or “should”.

B. Do your own HMM implementation. Follow the directions for the homework available here:

Check your implementation against the one in Scalabha. Did you make it faster, or more accurate? (There are a number of optimizations that could be done to improve the Scalabha implementation by quite a bit.)

Note: contributions of improvements to the HMM implementation to Scalabha are certainly welcome!

C. Enable a maxent tagger to be used, e.g. the OpenNLP part-of-speech tagger, by creating Scala code to train and use them. (You’ll need to know Java to do this.)

D. Try out any mix of tagging experiments on the Czech data.

Copyright 2011 Jason Baldridge

The text of this homework is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to and to this original homework.

Please email Jason at with suggestions, improvements, extensions and bug fixes.

Jason Baldridge,
Oct 23, 2011, 10:10 PM