Assignments‎ > ‎

HW5: Grammar

Homework 5: Grammar


Due: Dec 7, 2011 by 11am (on Blackboard)


This assignment involves creating a lexicon for analyzing sentences using Combinatory Categorial Grammar. There is no programming required for the assignment, but you will create files containing lexicons and then use Scalabha as an application to parse sentences with the lexicons.


IMPORTANT: You need to get Scalabha version 0.2.0. The version that you used for the POS tagging homework will not work. Instructions for getting v 0.2.0 are provided below.


Note: the description of the problems for this homework is (again) somewhat long -- don’t let this scare you. In fact, much of the text is there to give you documentation and help you know what to do.

Unless you absolutely must, don’t print this page out -- not only will you save paper, it will be far easier to cut-and-paste commands that you need to run.


If any of instructions or problem descriptions do not make sense to you, please get in touch with the instructor right away.



Preliminaries

Getting Scalabha


You must obtain and install Scalabha, a software package that will allow you to access code written to support this assignment. You can obtain it here:


https://bitbucket.org/jasonbaldridge/scalabha/downloads/scalabha-0.2.0-src.zip


Note: if you know how to use Mercurial (or are interested in using it), you can just clone the Scalabha repository.


You should unzip the contents of scalabha-0.2.0-src.zip and follow the instructions for installing it in the file scalabha-0.2.0/README. Note that one of those instructions requires you to set an environment variable SCALABHA_DIR that points to the location of Scalabha on your machine. At times, this homework will reference file locations that are relative to that directory.


Note: See the Bcomposes tutorial on build systems and Scalabha for help with things. However, keep in mind that for this assignment you only need to compile Scalabha -- you will not be doing any actual coding.


Submitting your solutions


For submission, you will produce just two files.

  • A file that contains your written answers to questions in the homework. This can be submitted on Blackboard as a PDF file, or you can hand in a physical (hand-written) copy to my office by the due date. There are things like CCG derivations that will be much easier to do written out. It is fine to submit a PDF that is a scan of your hand-written solutions -- just make sure it is quite legible.

  • A file that contains a lexicon for covering a set of English sentences. Submit this as simple_english.txt on Blackboard.



1. Jabberwockish (20 points)

This question is based on the following lexicon for the (made up) language Jabberwockish. You will need the application and composition rules, but not type-raising.


toves := n

vorpal := n

vorpal := n\n

ny := np\n

raths := n

oof := (s\(s\np))/np

borogroves := n

borogroves := s/np

gimble := s\np

gimble := (s\np)\pp

outgrabe := s\np

outgrabe := (s\np)\np

yog := pp/np

yog := ((s\np)/(s\np))/np

mek := (np\np)/(s\np)

mek := (np\np)/(s\pp)



Part (a)


The lexicon above doesn’t have an entry for the word “slithy”, however you hear the Jabberwockish sentence “toves slithy ny gimble”.


Written answer: Determine a category for “slithy” that supports a derivation that produces an s, and provide the derivation.


Part (b)


Written answer: Analyze the Jabberwockish sentence “vorpal ny mek gimble oof raths ny outgrabe” by drawing one derivation for it. Your derivation must produce an s category.


Part (c)


Written answer: Based on the lexicon given above, fill in the remaining cells of the following CKY chart for the sentence “borogroves vorpal ny gimble”. Make sure to indicate which constituents in which cells were used to create a new constituent by drawing lines, for example, as in cell [0,2]. Don’t forget to consider the composition rules as well as the function application rules. If there are two ways to form the same category in a cell, you may include that category twice.




Note: you do not need to print this specific chart out -- you are welcome to draw it (neatly) on paper and fill it in. Give yourself plenty of space.



2. Parsing with Scalaba’s implementation of CCG. (15 points)

I’ve added CCG parsing capabilities in Scalabha that can take a lexicon like that given in the previous problem and parse sentences with it. This problem is mainly here to introduce you to using the Scalabha CCG parser and make sure you are able to use it to successfully parse some sentences given a lexicon.

Part (a)

Go to the following directory:

$SCALABHA_DIR/data/ccg/jabberwockish/

In that directory, you find the file jabberwockish.txt, which contains the lexicon from the previous problem. Open it up and have a look at it. Then run the following command (in that directory), and check that you get the same result:

$ scalabha ccg jabberwockish.txt vorpal ny gimble

Input: vorpal ny gimble
Result: Set(s)

The Scalabha CCG parser is invoked with the command scalabha ccg. The first argument (here jabberwockish.txt) provides the name of the file containing the lexicon to use. The remaining arguments are the tokens to parse. The result indicated is a Set of categories that were found in the upper right-most corner of the CKY chart (the spanning cell). Here we have just one category. However, looking at the grammar, there should be another analysis that results in s\n. What happened is that the default rules that are used in the parser are just forward and backward application; with just those two rules, we have a categorial grammar known as the AB calculus (for Ajdukiewicz/Bar-Hillel). To get more, you need to turn the harmonic and crossed composition rules on, using the -r option.

$ scalabha ccg -r All jabberwockish.txt vorpal ny gimble

Input: vorpal ny gimble
Result: Set(s, s\n)

Written answer: give the derivation for “vorpal ny gimble” that results in s\n.

Part (b)

It is possible to comment out a lexical entry in a lexicon file by putting a # at the start of the line. This allows you to easily try out alternatives while developing a lexicon.

Open up jabberwockish.txt and comment out the second entry for vorpal so that it looks like the following:

#vorpal := n\n

Save jabberwockish.txt and then run the command again (make sure to use -r All). You should get the following output:

$ scalabha ccg -r All jabberwockish.txt vorpal ny gimble

Input: vorpal ny gimble
Result: Set(s)

Written answer: did you get that output? (Yes or No)

Make sure to leave the second entry for vorpal commented out for now.

Part (c)

The Scalabha CCG parser is also able to take a testbed of sentences that can be used to check the coverage of a lexicon. This is done by using the -t option and giving the name of a file containing sentences. For example, in the same directory as jabberwockish.txt, there are is a testbed file, testbed_jabberwockish.txt, with the following contents:

borogroves toves vorpal ny
raths ny gimble
toves ny yog vorpal ny gimble
vorpal ny mek gimble oof raths ny outgrabe
*toves ny
*ny toves gimble

A * in front of a sentence indicates it should not receive a parse. We can test that the non-starred sentences get a parse that results in s, and that the starred ones do not result in s as follows.

$ scalabha ccg -r All -t testbed_jabberwockish.txt jabberwockish.txt

Running testbed: testbed_jabberwockish.txt

FAILURE: Set()
borogroves toves vorpal ny

SUCCESS: Set(s)
raths ny gimble

SUCCESS: Set(s)
toves ny yog vorpal ny gimble

SUCCESS: Set(s, s\np)
vorpal ny mek gimble oof raths ny outgrabe

SUCCESS: Set(np)
*toves ny

SUCCESS: Set()
*ny toves gimble

For each sentence, we get success or failure, where success means either it should have produced an s and it did or it should not have produced s and it did not; similarly, but conversely, for failure. For example, note that toves ny produced an np; since that is not an s, the fact that the string is starred means we succeeded.

The one failure is for the first sentence. The problem is that we commented out the second entry for vorpal, which is needed for that sentence. Uncomment that line in jabberwockish.txt so that both lexical entries for vorpal are active, save the file and then run the parser on the testbed again.

Written answer: Did you get a success for every sentence now?

Part (d)

Written answer: Which testbed sentences fail if you use the AB rules?

Part (e)

Written answer: The CCG parser implementation is done in the files in scalabha/src/main/scala/opennlp/scalabha/ccg. There are a lot of aspects of Scala programming in these that we have not covered in this course (unfortunately not enough time!). Look through those and describe at least two things that you understand about what the code is doing and how it relates to doing CCG analyses with paper-and-pencil.


3. Analyzing a set of sentences with CCG. (10 points)


English has an impoverished system for marking case that only shows up on pronouns. For example, I, he, and they have nominative case, while me, him, and them have accusative case. It also has very simple subject-verb agreement (e.g. I walk, he walks, *I walks, *he walk). This problem involves creating a lexicon that will handle some English transitive sentences and relative clauses while respecting case marking and subject-verb agreement.

Note: problem 4 involves an implementation of this problem using Scalabha’s CCG parser. You are welcome to do this problem and problem 4 in parallel -- effectively using the parser to check that your lexicon is working. However, if you prefer, you are welcome to work this problem out on paper and then do problem 4.

Also note that most of the points for the lexicon will be given as part of  problem 4. (The 10 points here are given primarily for the derivation requested below.)

Create a CCG lexicon that correctly includes or excludes the sentences in the following testbed.


she sees him

they see him

they see the woman

they see the women

I know her

I know the man whom she sees

I know the woman who sees him

the woman who sees him walks

the women who see him walk

*she sees he

*they sees him

*me know her

*I know the woman whom sees him

*the woman who sees him walk

*the woman who see him walks

*the women who sees him walk

*the women who see him walks


Note: The string “*the woman who sees him walk’ can be read as a noun phrase [the woman who [sees him walk]], but that is not intended here and should be ignored -- after all, you have not been asked to provide an analysis for main clauses like “I see him walk.” The intended non-reading is where the woman is doing the walking and happens to “see him.”


You can use feature structures in the categories (e.g., np[case=nom] instead of atomic labels like np_nom).  (though you can actually handle the effects of person and number with a single feature pernum.  


We didn’t talk about feature structures in class, but they are very simple for this particular use case, and the features case, person, and number are sufficient for this problem. For example, the word I is a first-person singular nominative pronoun, so it has the category np[case=nom,per=1,num=sg], which you can contrast with them (third-person plural accusative) with the category np[case=acc,per=3,num=pl]. Before you launch into using all three features, however, there is a standard simplification that works for ensuring subject-verb agreement in English (because of its impoverished morphology). Basically, you can use a single feature pernum that conflates person and number, so you have a value 3rdSg for words like she and woman, and non3rdSg for other nouns and noun phrases. As such, we have category assignments like the following:

  • I := np[case=nom, pernum=non3rdSg]

  • she := np[case=nom, pernum=3rdSg]

  • us := np[case=acc, pernum=non3rdSg]

  • you := np[pernum=non3rdSg]


Notice how you can be both nominative or accusative, so we don’t need to specify a value for case. This allows it to match with features structures that have case=nom or case=acc.


So how does that work? If we have a verb that requires a non-third person singular subject, like walk, we can use the category s\np[case=nom,pernum=non3rdSg]. When the backward application rule is used to combine the categories for you walk, it must check whether the category np[pernum=non3rdSg] of you unifies with (not equals) the argument np[case=nom,pernum=non3rdSg] of walk. Because you’s category does not have a value for case and because the value it has for pernum matches that of walk’s, it unifies, so the rule can apply and produce s.


In addition to leaving some values unspecified, you can use variables that will be bound during unification and allow a value to be passed on. For example, consider a phrase like tall man. Adjectives in English don’t need to agree on number with the nouns they modify, so we can have the category n[pernum=X]/n[pernum=X] for tall. Given the category n[pernum=3sg] for man, deriving the category for tall man involves forward application of n[pernum=X]/n[pernum=X] to n[pernum=3sg]; since n[pernum=X] unifies with n[pernum=3sg] (with the variable resolution X=3sg), the unification succeeds. The variable resolution is then applied to n[pernum=X] (the result of the forward application rule), giving n[pernum=3sg] as the resulting category.


You are of course welcome to do without feature structures if you wish; this will involve using multiple categories to handle the agreement, e.g. you’d need both n_sg/n_sg and n_pl/n_pl for tall. However, before go without features, check with the instructor if you have any doubts. It really is quite easy to use, and you don’t have to do any of the unification -- you just need to declare the features and let the parser handle the rest.


Tip: Since Scalabha’s CCG parser doesn’t support type-raising, you’ll need to add type-raised entries (e.g. of the form s\(s/np)) explicitly to your lexicon, as needed.


Written answer: Given the lexicon you have created, draw a derivation for the sentence I know the man whom she sees.


Note: you will submit your lexicon as a file, as part of the next problem, and as discussed in the preface to this homework.



4. Use the lexicon with Scalabha (30 points)

In this problem, you’ll “implement” the lexicon from problem 3 and test it using Scalabha’s CCG parser.


Part (a)


Write your lexical entries for she, sees, and him from the previous problem in the stub file simple_english.txt, and then test the lexicon on the sentence “she sees him. ” Doing so should look like this.


$ scalabha ccg simple_english.txt she sees him


Input: she sees him

Result: Set(s)


You’ll might find that you messed up on some of your entries and used bad formatting and such, like forgetting brackets or parentheses. Scalabha will give you error output showing you where such errors occur. For example, consider the following lexical entry:


sees := (s\np[case=nom)/np


It is missing a right bracket after nom. If this is in simple_english.txt, the following error is produced when the parser is run:


$ scalabha ccg simple_english.txt she sees him


Exception in thread "main" opennlp.scalabha.ccg.CatParserException:


Couldn't process the following entry:


sees := (s\np[case=nom)/np


Error: `]' expected but `)' found


Mostly the format is self-explanatory. A few things that should be mentioned are:

  • A value in a feature/value pair is interpreted as a variable if it starts with an uppercase letter, and as a constant if it starts with a lowercase letter or number. E.g. for case=X, X is a variable, and for case=nom, nom is a constant.

  • You can use the characters A-Z, a-z, 0-9, and underscore in attributes, variables and constants.

  • Categories must start with a lowercase letter; any other characters can be a-z, 0-9 or underscore. (You only need s, n and np for this problem if you use features, but the extra characters might be useful if you do not.)


If you have any problems getting your lexical entries to be accepted that you can’t figure out, get in touch with the instructor right away.


If you are curious to see how strings like (s\np)/np are turned into a Cat objects, see the file scalabha/src/main/scala/opennlp/scalabha/ccg/ExpressionParser.scala.


Written answer: Did you successfully get “she sees him” to parse?



Part (b)


You could expand your lexicon incrementally, testing it on other sentences as you did for part (a). However, it is easier to do this be working with a testbed. Start off by saving the testbed given in the previous problem as a text file, e.g. testbed_simple_english.txt. You should run the current lexicon on that testbed and see what succeeds and fails, and then expand the lexicon entry by entry until every sentence succeeds. Don’t forget that the default rules are just the AB ones.


Written answer: Did you successfully get your lexicon working for the entire testbed? Briefly describe any challenges you ran into while doing so, if any.


Tip: The parser will complain about missing entries in the lexicon for words in the testbed. You can start by adding all of the following placeholder entries as your starting point. The category x is meaningless (Scalabha interprets it as an atomic category, and thus happily accepts the entry, but won’t make any use of it.)


she := x

they := x

I := x

he := x

her := x

him := x

me := x

know := x

man := x

woman := x

women := x

see := x

sees := x

the := x

walk := x

walks := x

who := x

whom := x


Obviously, you’ll want to put your entries for she, sees, and him in place of the x categories. And, you will need multiple entries for some words -- this just ensures that each word in the testbed has some category.



5. Adding adverbs (25 points)

The sentences dealt with so far don’t handle adverbs. Let’s change that.


Part (a)


Expand simple_english.txt with one or more entries for the adverb usually. You should be able to cover the following testbed successfully:


usually she sees him

she usually sees him

she sees usually the man whom I know

she sees him usually

*she usually see him

*she see him usually

*me see him usually

I know the man whom she sees usually


Written answer: Did any of these end up being surprises to you, either because you expected them to parse and they didn’t or because you didn’t expect them to parse and they did?


Part (b)


Written answer: Your solution will have created an ambiguity for “I know the man whom she sees usually.” Provide a derivation for each interpretation.



EXTRA. Additional suggested exercises. (not required)

If you want to go further, there are a number of things you could consider doing. If you do any of these, or think of something else, write down a brief description of what you did, what the results were, and any other relevant information. This could contribute to you getting a score greater than 85% since this would constitute an above-and-beyond component of completing the homework.


A. (A fair amount of effort.) The parser doesn’t give very informative output about the analysis -- just a final result category is produced. Change this so that some history of the derivation is kept within the CKY chart entries, such that something like a derivation can be pulled out at the end.


B. (A fair amount of effort.) The CKY implementation is done via a mostly imperative fashion. Rewrite it using a functional style.


C. (A large amount of effort.) Add semantics, e.g. as lambda terms to the definitions of Cat, such that a parse output has both category and semantics. You’ll need to change a lot of things, including the core categories, the rules, the CKY algorithm, and the expression parser (such that they can be read in from a defined lexicon).


D. (Open-ended.) Look at the implementation and see if you can improve the design, e.g. how unification is handled, how categories and rules are defined, or anything else you can sink your teeth into.


Feel free to ask me for any help or guidance you need in doing these.




Copyright 2011 Jason Baldridge

The text of this homework is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original homework.

Please email Jason at jasonbaldridge@gmail.com with suggestions, improvements, extensions and bug fixes.

Comments