FST Exercise

Rule Ordering

Using FSA

On bulba type:

% fsa -tk

A GUI comes up.

There are two lines you can type into, labeled "Regex" and "String"

"Regex" accepts a regular expression that lets you define an FSA or FST by hand.

"String" accepts a string to be submitted to the current FSA by clicking on "Submit".

FSA
Manual

Manual

Example
Regexp

On the Regexp line type:

{[a,b,c]}

Your first FSA!

Notice the color-coding on the states. What does it mean?

Note that you MUST use the square brackets for concatentation (a Prolog "list"). Note that each element of the list is assumed to be an element of your "alphabet":

{[aa,ab,cc]}

Notice the machine has 'aa','ab' and 'cc' labeling transitions.. Probably not what you intended. No normal string will be recognized. Try 'aaabcc'.

Now try this as a short cut for spelling our a long sequence of transitions:

word(aabbcc)

Equivalent to:

[a,a,b,b,c,c]

Saving an
FSA

Click on File>Save as Postscript.

Save the file in your home directory, choosing a name with ".ps" extension (like ("abc.ps"). Your home printer will print a postscript file if you know how to send it to the printer directly from the printer manager.

NOTE: Before saving an fsa you can drag STATES to better looking locations with your mouse. Pulling states further apart often makes a diagram more readable.

NOTE: Read the manual for OTHER formats you can save fsa's in.

Viewing
Postscript
Files

Bring up an xterm window. Connect to your home directory if necessary.

Type

gv abc.ps &

"gv" is a program that views postscript files. "abc.ps" is the assumed name of the FSA you saved in the previous step. "&" backgrounds the job so that while viewing the file you still have the smae xterm window to talk to.

More
Regexps

Back in the FSA window, type:

{[a,b,c],[c,d,e]}

Now try:

{[a,b,c]*,[c,d,e]}

What does the color coding on State 0 mean?

Try:

{[a,b,c]+,[c,d,e]}

Some languages
to try:

Try inputting thse (expressed in the textbook's Reg exp notation) into FSA:

ab*c
(ab*c)+
a | (an) | (the)
the set of all strings from the alphabet a,b such that each a is immediately preceded and followed by a b.

Submitting
strings

Start out with the FSA for reg exp (4) from the preceding step. In the string window type:

bbbbab

Now check out the xterm Window you started "fsa" from.

You're looking at this:

?-

yes

"?-" is a Prolog prompt. "yes" is the Prolog representation of success. Test. Back in the fsa window, type:

bba

Inputting
FSTs

Try:

[(a:b)*,(b:a)*]

Now type the following in the string window:

aaa (check output)
 
bbbb (check output)
 
aaaanbbbbb 

bbbbaaaa

Try this short cut for making transducers:

[b,a,a,!]xx[1,2,2,!]

Now try the following very SIMPLE FST (in the REegexp window):

'4':four

This transduces the digit '4' into the word 'four'. Notice the digit has to be in single quotes.. Now try to write a compact regexp that does the following transductions:

4: four
41: forty one
42: forty two
43: forty three
44: forty four

It should recognize 41: forty one by the sequence:

4:forty 1:one Now add:

40: forty to the same language. You will need epsilon transitions. Here's an example of how to insert a "0":

[]:0 A possible answer!

Karttunen
Example

Lauri Karttunen is a Finnish linguist who has done much work on morphology and FSTs.

There is an example involving assimilation in Karttunen 91. See especially p. 4 where the discussion of kaNpan -> kampan -> kamman begins.

Note that several rules are discussed:

This issue is that they are ordered as shown and we want to model that ordering in our transducer.

Examples:

Upper	Lower	Rules used
Kampan	kamman	{(c) }
KaNpan	kamman	{(a), (c) }
KaNtan	kantan	{(b)}
Kanpan	kanpan	∅

Now consider another order:

Here's what happens

Upper	Lower	Rules used
kampan	kamman	{(c) }
kaNpan	kampan	{(a)}
kantan	kantan	{(b)}
kanpan	kanpan	∅

The change

(c) ordered last kaNpan -> kaNpan ∅

(c) ordered first kaNpan -> kamman {(a), (c) }

Note, for some background on the analysis of Canadian raising and flapping, see Canadian Raising, Opacity, and Rephonemicization, by William J. Idsardi, which reviews the Chomsky, Halle, Joos controversy surrounding these facts, and presents some alternative interpretations.

Malouf's
Twolevel
Rule implementation

Here is Rob's version of rule (a). for the fsa toolkit.

Note:

macro(nm, comp(pairs, 'N':m, [],   p: ?  ))

gives the rule the name "nm", says the feasible pairs of the machine are defined by the regex "pairs", and say 'N' is realized as 'm' when the left context is anything, and the right context on the UPPER (underlying) tape is 'p'.

The regex pairs is a set of pairs from the upper and lower language, intituitvely, all possible pairs. It is defined and named in the following line:

macro(pairs,{a..z,'N':m,'N':n,p:m}).

Note:

a..z

is short for

a:a,b:b,c:c,.... z:z

That is among the feasible pairs of our language: any lower case letter can be realized as itself.

Furthermore, 'N' must be realize either as 'n' or 'm', and 'p' may be realized as 'm'.

'N':m
'N':n
 p:m

The single quotes will be required for any CAPITAL letters; this is an artifact of Prolog syntax.

A key point to watch is 'p'. Notice that our language is defined so as to allow a 'p' to surface either as a 'p' or an 'm'. The set of feasible pairs must be defined to include all possible "rewrites". It is the job of the rules to make sure the reqrites happen only in the appropriate contexts.

Here is Rob's version of both rules.

How to load a file with both rules into the toolkit:

File > Load Aux [select /opt/fsa6/Examples/twolevel/kamman.pl]
Regex window: nm (the nm machine now gets drawn). This ONLY implements rules (a)and (b).
Type "kaNpan"
Check your output window and think about what happened. Is this correct? [The output window is the window you used to originally start 'fsa', the one you typed 'fsa -tk' into.]
Now type "kamman" in the regex window. This loads a machine that implements ALL the rules (a)-(c). Now type "kaNpan" in the STRING window and check the output window.

Assignment:
Problem 2 e-insertion

Answer the questions in boldface beflow. Turn in your answers. Also turn in a revised version of the file jurafsky.pl as described below.

e-insertion transducer

You can bring up a two level transducer version of Jurafsky's e-insertion rule einsertion transducer in fsa by doing the following:

Click on File>Loadaux. Load in the file:
There is another copy of the file here.
Type "einsertion" in the regex window. An fsa should be displayed.
In this implementation, to simplify things, I've used "+" for both morpheme and word boundaries.
To test the e insertion rule, type
into the string window. Check the output in the host xterm window. Q1: What is the output?
Try:
Q2: What is the output?
Revise the rule to handle the case of words ending in "ch" and "sh". Here are some hints.
1. The left and right contexts of rules are regular expressions (as notated in the fsa tool). You may want to go back and look at how regular expressions in the fsa work as we we practice them in our lab exercise
2. Obviously the left context of the the e-insertion rule needs to be generalized. Currently the left context is:
  This means an "x","s" or "z" concatenated with a "(+):0". "(+):0" is how the rule notation represents a morpheme boundary realized as the empty string, using a "0" to represent the empty string. Note that curly braces { } represent "or" and square brackets [ ] represent concatenation.

Problem 3
Flap Rule

Flap Problem

What to
hand
in

kamman.pl
1 page (or at most 2) answering the following questions about kamman.pl
- What happened when you had the "nm" machine loaded and you typed "kaNpan". Was this the correct behavior for a machine that only implemented rules (a) and (b)? Why?
- What happened when you had the "kamman" machine loaded and you typed "kaNpan". Was this the correct behavior for a machine that implemented rules (a), (b) and (c)?
- If you look at the rule
  all it says is that 'N" gets realized as 'm' if and only if the following environment is 'p' realized as anything (and the preceding environment is anything). Now on the face of it this sounds like it just implements rule (a). Is it correct to say that "nm" as defined here implements both rules (a) and (b)? How can you test to see if rule (b) is implemented? If rule (b) IS implemented, how has this happened?
- What would happen if we combined 'nm' with the following rule:
  In particular, what would we get if we loaded a machine that combined the 'nn' rule with the 'nm' rule and then typed 'kaNpan' to the String window? What about 'kaNnan'? What about 'kaNtan'? What about 'N'? You can do this two ways. You can engage if the pure pleasure of a priori reasoning, or you can make a copy of 'kamman.pl' in your home directory, edit it to make those changes, and then 'Load Aux' it into fsa to see what happens. For max educational benefit, do it both ways (a priori reasoning first; i.e., guess!). However you do it, report, by filling out the following table:
Explain what happened.
Jurafsky.pl. Follow the directions for Problem 2 above, the extended version of the e-insertion rule, and hand in a version of the solution.
flap_problem.pl. Follow the directions for Problem 3 above, the flap rule, and hand in a solution.