Intro to Comp Ling

Tagging

Tagging
Task

Sentence	CLAWS-F5	Brown	Penn	ICE
she	PNP	PPS	PRP	PRON(pers,sing)
was	VBD	BEDZ	VBD	AUX(past,past)
told	VVN	VBN	VBN	V(ditr,edp)
that	CJT	CS	IN	CONJUNC(subord)
the	AT0	AT	DT	ART(def)
journey	NN1	NN	NN	N(com,sing)
might	PNP	PPS	PRP	AUX(modal,passt)
kill	PNP	VB	VB	V(montr,infin)
her	PNP	PP0	PRP	PRON(pers,sing)
.	PUN	.	.	PUNC(per)

Many distinctions much finer grained than standard set of linguist categories. Some sort of yuselessly so. Example: Brown tagset and some descendants have distinct tags for differnt forms of verb BE/HAVE (HVD=had).

Not as fine-grained as a unification gramnmar with features.

Uses of
Tagging

Well, why would you want to do this?

Tagged Corpora and corpus sites.

Question: How was the BNC tagged?

Difficulty
of Problem

Ambiguity

The Null hypothesis tagger: Tag each word with its most frequent tag. Gets about 90% right.

One major source of difficulty is tag ambiguity. Many tag ambiguities are quite systematic in ways that are particular for English. For instance, here's some information about tag ambiguities in BNC, taken from the BNC website. (bnc2error.htm#table2)

(a) Tag	(b) SingleTag count (out of 50,000 words)	(c) Ambiguity Tag count (out of 50,000 words)	(d) Ambiguity rate (%) (c / b + c)	(e) 1st tag of ambiguity tag correct (% of all ambiguity tags)	(f) Error count	(g) Error rate (%) (f / b)
AJ0	3412	all 338	9.01%	282 (83.43%)	46	1.35%
		(AJ0-AVO 48)
		(AJ0-NN1 209)
		(AJ0-VVD 21)
		(AJ0-VVG 28)
		(AJ0-VVN 32)
AJC	142		0.0%		4	2.82%
AJS	26		0.0%		2	7.69%
AT0	4351		0.0%		2	0.05%
AV0	2450	all 45	1.80%	37 (82.22%)	57	2.33%
		(AV0-AJ0 45)
AVP	379	all 44	10.40%	34 (77.27%)	6	1.58%
		(AVP-PRP 44)
AVQ	157	all 10	5.99%	10 (100.00%)	9	5.73%
		(AVQ-CJS 10)
CJC	1915		0.0%		3	0.16%
CJS	692	all 39	5.34%	30 (76.92%)	18	2.60%
		(CJS-AVQ 26)
		(CJS-PRP 13)
CJT	236	(all) 28	10.61%		3	1.27%
		(CJT-DT0 28 )
CRD	940	all 1	0.11%	0 (0.00%)	0	0.00%
		(CRD-PNI 1)
DPS	787		0.0%		0	0.00%
DT0	1180	all 20	1.67%	16 (80.00%)	19	1.61%
		(DT0-CJT 20)
DTQ	370		0.0%		0	0.00%
EX0	131		0.0%		1	0.76%
ITJ	214		0.0%		2	0.93%
NN0	270		0.0%		10	3.70%
NN1	7198	all 514	6.66%	395 (76.84%)	86	1.19%
		(NN1-AJ0 130)
		(NN1-NP0 92)*
		(NN1-VVB 243)
		(NN1-VVG 49)
NN2	2718	all 55	1.98%	48 (87.27%)	30	1.10%
		(NN2-VVZ 55)
NP0	1385	all 264	16.01%	224 (84.84%)	31	2.24%
		(NP0-NN1 264)*
ORD	136		0.0%		0	0.00%
PNI	159	all 8	4.79%	3 (37.50%)	5	3.14%
		(PNI-CRD 8)
PNP	2646		0.0%		0	0.00%
PNQ	112		0.0%		0	0.00%
PNX	84		0.0%		0	0.00%
POS	217		0.0%		5	2.30%
PRF	1615		0.0%		0	0.00%
PRP	4051	all 166	3.94%	154 (92.77%)	24	0.59%
		(PRP-AVP 132)
		(PRP-CJS 34)
TO0	819		0.0%		6	0.73%
UNC	158		0.0%		4	2.53%
VBB	328		0.0%		1	0.30%
VBD	663		0.0%		0	0.00%
VBG	37		0.0%		0	0.00%
VBI	374		0.0%		0	0.00%
VBN	133		0.0%		0	0.00%
VBZ	640		0.0%		4	0.63%
VDB	87		0.0%		0	0.00%
VDD	71		0.0%		0	0.00%
VDG	10		0.0%		0	0.00%
VDI	36		0.0%		0	0.00%
VDN	20		0.0%		0	0.00%
VDZ	22		0.0%		0	0.00%
VHB	150		0.0%		1	0.67%
VHD	258		0.0%		0	0.00%
VHG	16		0.0%		0	0.00%
VHI	119		0.0%		0	0.00%
VHN	9		0.0%		0	0.00%
VHZ	116		0.0%		1	0.86%
VM0	782		0.0%		3	0.38%
VVB	560	all 84	13.04%	56 (66.67%)	84	15.00%
		(VVB-NN1 84)
VVD	970	all 90	8.49%	62 (58.89%)	50	5.15%
		(VVD-AJ0 11)
		(VVD-VVN 79)*
VVG	597	all 132	18.11%	112 (84.84%)	9	1.51%
		(VVG-AJ0 83)
		(VVG-NN1 49)
VVI	1211		0.0%		7	0.58%
VVN	1086	all 158	12.70%	113 (71.52%)	27	2.49%
		(VVN-AJ0 50)
		(VVN-VVD 108)*
VVZ	295	all 26	8.10%	14 (53.85%)	11	3.73%
		(VVZ-NN2 26)
XX0	363		0.0%		0	0.00%
ZZ0	75		0.0%		3	4.00%

Another major source of difficulty and a major contributer to the error with all known taggers is unknown words. What do you tag an unknown word?

Default strategy: Look at what tag most unknown words get in some development test data. Use that tag for all unknown words.

What tag is that? Guess....

A possible augmentation for any tagger: A dictionary. [Reduces but does not eliminate unknown words.]

Tagger Error Rate Note

Church HMM tagger 1-5% Depending on Def of "correct"

Garside et al. 3-4% Probabilistic
Plus idiom rules

De Rose 3-4%, 5.6% WSJ, other

Brill Initial 7.9% "Simple" algorithm

Brill 5% 71 "patches" (rules)

Brill's tagger

Approaches

Two standard approaches:

Rule-based
Statistically based

Within rule-based we can distinguish two other types:

Handwritten
Machine-learned

Tagset
Variation

Some tagsets are harder than others.

Tag Set	Basic Size	Total tags
Brown	87	179
Penn	45
CLAWS1	132
CLAWS2	166
CLAWS c5 (BNC)	62
London Lund	197

Brill says: "There are 192 tags in the Brown corpus, 96 of which occur more than 100 times."

BNC Tagset ("Claws", C5)

One potential task is to define a tagset that maximizes utility for parsing.

Brill's
tagger

Properties

Rule based
Rules are automatically learned.

A rule space.

Possible rule forms:

If a word is tagged a and it is in context C, then change that tag to b.
If a word is tagged a and it has lexical property P C, then change that tag to b.
If a word is tagged a and and a woprd in region R has lexical property P C, then change that tag to b.

Possible patch templates (rule templates):

The preceding (following) word is tagged z.
The word 2 after (before) is tagged z.
One of the two following (preceding) words is tagged z.
One of the three following (preceding) words is tagged z.
The preceding word is tagged z and the following word is tagged w.
The preceding (following) word is tagged z and the word 2 before (after) is tagged w.
The current word is (is not) capitalized.
The previous word is (is not) capitalized.

Training Algorithm (supervised):

Initial training: For each word learn its most frequent tag. [Training corpus, 90%] [Note: This is the first place where lexically specific info comes into play.]
Patch acquisition [Development Test corpus,5%]:
1. Collect a list of error triples in the form [Tag_a, Tag_b, Number]
2. For each error triple and each patch template, find the patch template that gives the best net error gain, where
  and add that patch to the patch list.
Runtime[Test Corpus:5%]
1. Tag the test corpus using the initial training tagger.
2. Revise for each word w and each patch template changing tag a to tag b whenever word w occurs in the training corpus with tag b. [Note: This is the second place where lexically specific info comes into play.]

Note: there are two simple but important refinements of the initial training having to do with the treatment of unknown words.

Capitalized unknown words are tagged as proper names.
For other unknown words, assign the tag most common for words ending in the same 3 letters:
gets tagged an adjective.

Some sample rules found by Brill's algorithm:

TO IN NEXT-TAG AT
VBN VBD PREV-WORD-IS-CAP YES
VBD VBN PREV-1-OR-2-OR-3-TAG HVD
TO IN NEXT-WORD-IS-CAP YES
NN VB PREV-TAG MD
PPS PPO NEXT-TAG .
VBN VBD PREV-TAG PPS
NP NN CURRENT-WORD-IS-CAP NO

Key:

TO Infinitval to
AT Article
IN Preposition
VBN past tense Verb
VBN past participle Verb
NP  Proper Noun
NN Common Noun
MD Modal
PPO Objective (Accusative) Personal Pronoun
PPS Subject (Nominative) Personal Pronoun
HVD Had

Summary:

Competitive with statistical taggers.
Portable, doesnt depend on any particular tagset/corpus properties
Simple/low memory overhead.