Tagging
Task
|
 
|
The Penn Treebank Tagset
Sentence |
CLAWS-F5
|
Brown
|
Penn
|
ICE
|
she
|
PNP
|
PPS
|
PRP
|
PRON(pers,sing)
|
was
|
VBD
|
BEDZ
|
VBD
|
AUX(past,past)
|
told
|
VVN
|
VBN
|
VBN
|
V(ditr,edp)
|
that
|
CJT
|
CS
|
IN
|
CONJUNC(subord)
|
the
|
AT0
|
AT
|
DT
|
ART(def)
|
journey
|
NN1
|
NN
|
NN
|
N(com,sing)
|
might
|
PNP
|
PPS
|
PRP
|
AUX(modal,passt)
|
kill
|
PNP
|
VB
|
VB
|
V(montr,infin)
|
her
|
PNP
|
PP0
|
PRP
|
PRON(pers,sing)
|
.
|
PUN
|
.
|
.
|
PUNC(per)
|
Many distinctions much finer grained than standard set of
linguist categories. Some sort of yuselessly so.
Example: Brown tagset and some descendants have distinct tags for
differnt forms of verb BE/HAVE (HVD=had).
Not as fine-grained as a unification gramnmar with features.
|
Uses of
Tagging
|
 
|
Well, why would you want to do this?
Tagged Corpora and corpus sites.
Question: How was the BNC tagged?
|
Difficulty
of Problem |
 
|
Ambiguity
The Null hypothesis tagger: Tag each word with its most
frequent tag. Gets about 90% right.
One major source of difficulty is tag ambiguity.
Many tag ambiguities are quite systematic in
ways that are particular for English. For instance,
here's some information about tag ambiguities
in BNC, taken from the BNC
website. (bnc2error.htm#table2)
(a) Tag |
(b) SingleTag count (out of 50,000 words) |
(c) Ambiguity Tag count (out of 50,000 words)
|
(d) Ambiguity rate (%)
(c / b + c) |
(e) 1st tag of ambiguity tag correct (% of all ambiguity tags) |
(f) Error count |
(g) Error rate (%)
(f / b) |
AJ0 |
3412 |
all 338 |
9.01% |
282 (83.43%) |
46 |
1.35% |
|
|
(AJ0-AVO 48) |
|
|
|
|
|
|
(AJ0-NN1 209) |
|
|
|
|
|
|
(AJ0-VVD 21) |
|
|
|
|
|
|
(AJ0-VVG 28) |
|
|
|
|
|
|
(AJ0-VVN 32) |
|
|
|
|
AJC |
142 |
|
0.0% |
|
4 |
2.82% |
AJS |
26 |
|
0.0% |
|
2 |
7.69% |
AT0 |
4351 |
|
0.0% |
|
2 |
0.05% |
AV0 |
2450 |
all 45 |
1.80% |
37 (82.22%) |
57 |
2.33% |
|
|
(AV0-AJ0 45) |
|
|
|
|
AVP |
379 |
all 44 |
10.40% |
34 (77.27%) |
6 |
1.58% |
|
|
(AVP-PRP 44) |
|
|
|
|
AVQ |
157 |
all 10 |
5.99% |
10 (100.00%) |
9 |
5.73% |
|
|
(AVQ-CJS 10) |
|
|
|
|
CJC |
1915 |
|
0.0% |
|
3 |
0.16% |
CJS |
692 |
all 39 |
5.34% |
30 (76.92%) |
18 |
2.60% |
|
|
(CJS-AVQ 26) |
|
|
|
|
|
|
(CJS-PRP 13) |
|
|
|
|
CJT |
236 |
(all) 28 |
10.61% |
|
3 |
1.27% |
|
|
(CJT-DT0 28 ) |
|
|
|
|
CRD |
940 |
all 1 |
0.11% |
0 (0.00%) |
0 |
0.00% |
|
|
(CRD-PNI 1) |
|
|
|
|
DPS |
787 |
|
0.0% |
|
0 |
0.00% |
DT0 |
1180 |
all 20 |
1.67% |
16 (80.00%) |
19 |
1.61% |
|
|
(DT0-CJT 20) |
|
|
|
|
DTQ |
370 |
|
0.0% |
|
0 |
0.00% |
EX0 |
131 |
|
0.0% |
|
1 |
0.76% |
ITJ |
214 |
|
0.0% |
|
2 |
0.93% |
NN0 |
270 |
|
0.0% |
|
10 |
3.70% |
NN1 |
7198 |
all 514 |
6.66% |
395 (76.84%) |
86 |
1.19% |
|
|
(NN1-AJ0 130) |
|
|
|
|
|
|
(NN1-NP0 92)* |
|
|
|
|
|
|
(NN1-VVB 243) |
|
|
|
|
|
|
(NN1-VVG 49) |
|
|
|
|
NN2 |
2718 |
all 55 |
1.98% |
48 (87.27%) |
30 |
1.10% |
|
|
(NN2-VVZ 55) |
|
|
|
|
NP0 |
1385 |
all 264 |
16.01% |
224 (84.84%) |
31 |
2.24% |
|
|
(NP0-NN1 264)* |
|
|
|
|
ORD |
136 |
|
0.0% |
|
0 |
0.00% |
PNI |
159 |
all 8 |
4.79% |
3 (37.50%) |
5 |
3.14% |
|
|
(PNI-CRD 8) |
|
|
|
|
PNP |
2646 |
|
0.0% |
|
0 |
0.00% |
PNQ |
112 |
|
0.0% |
|
0 |
0.00% |
PNX |
84 |
|
0.0% |
|
0 |
0.00% |
POS |
217 |
|
0.0% |
|
5 |
2.30% |
PRF |
1615 |
|
0.0% |
|
0 |
0.00% |
PRP |
4051 |
all 166 |
3.94% |
154 (92.77%) |
24 |
0.59% |
|
|
(PRP-AVP 132) |
|
|
|
|
|
|
(PRP-CJS 34) |
|
|
|
|
TO0 |
819 |
|
0.0% |
|
6 |
0.73% |
UNC |
158 |
|
0.0% |
|
4 |
2.53% |
VBB |
328 |
|
0.0% |
|
1 |
0.30% |
VBD |
663 |
|
0.0% |
|
0 |
0.00% |
VBG |
37 |
|
0.0% |
|
0 |
0.00% |
VBI |
374 |
|
0.0% |
|
0 |
0.00% |
VBN |
133 |
|
0.0% |
|
0 |
0.00% |
VBZ |
640 |
|
0.0% |
|
4 |
0.63% |
VDB |
87 |
|
0.0% |
|
0 |
0.00% |
VDD |
71 |
|
0.0% |
|
0 |
0.00% |
VDG |
10 |
|
0.0% |
|
0 |
0.00% |
VDI |
36 |
|
0.0% |
|
0 |
0.00% |
VDN |
20 |
|
0.0% |
|
0 |
0.00% |
VDZ |
22 |
|
0.0% |
|
0 |
0.00% |
VHB |
150 |
|
0.0% |
|
1 |
0.67% |
VHD |
258 |
|
0.0% |
|
0 |
0.00% |
VHG |
16 |
|
0.0% |
|
0 |
0.00% |
VHI |
119 |
|
0.0% |
|
0 |
0.00% |
VHN |
9 |
|
0.0% |
|
0 |
0.00% |
VHZ |
116 |
|
0.0% |
|
1 |
0.86% |
VM0 |
782 |
|
0.0% |
|
3 |
0.38% |
VVB |
560 |
all 84 |
13.04% |
56 (66.67%) |
84 |
15.00% |
|
|
(VVB-NN1 84) |
|
|
|
|
VVD |
970 |
all 90 |
8.49% |
62 (58.89%) |
50 |
5.15% |
|
|
(VVD-AJ0 11) |
|
|
|
|
|
|
(VVD-VVN 79)* |
|
|
|
|
VVG |
597 |
all 132 |
18.11% |
112 (84.84%) |
9 |
1.51% |
|
|
(VVG-AJ0 83) |
|
|
|
|
|
|
(VVG-NN1 49) |
|
|
|
|
VVI |
1211 |
|
0.0% |
|
7 |
0.58% |
VVN |
1086 |
all 158 |
12.70% |
113 (71.52%) |
27 |
2.49% |
|
|
(VVN-AJ0 50) |
|
|
|
|
|
|
(VVN-VVD 108)* |
|
|
|
|
VVZ |
295 |
all 26 |
8.10% |
14 (53.85%) |
11 |
3.73% |
|
|
(VVZ-NN2 26) |
|
|
|
|
XX0 |
363 |
|
0.0% |
|
0 |
0.00% |
ZZ0 |
75 |
|
0.0% |
|
3 |
4.00% |
Another major source of difficulty and a major contributer to the
error with all known taggers is unknown words. What do you
tag an unknown word?
Default strategy: Look at what tag most unknown words get in
some development test data. Use that tag for
all unknown words.
What tag is that? Guess....
A possible augmentation for any tagger: A dictionary.
[Reduces but does not eliminate unknown words.]
Tagger | Error Rate | Note |
Church HMM tagger
|
1-5%
|
Depending on Def of "correct"
|
Garside et al.
|
3-4%
|
Probabilistic
Plus idiom rules
|
De Rose
|
3-4%, 5.6%
|
WSJ, other
|
Brill Initial
|
7.9%
|
"Simple" algorithm
|
Brill
|
5%
|
71 "patches" (rules)
|
Brill's tagger
|
Approaches
|
 
|
Two standard approaches:
- Rule-based
- Statistically based
Within rule-based we can distinguish two other types:
- Handwritten
- Machine-learned
|
Tagset
Variation
|
 
|
Some tagsets are harder than others.
Tag Set
|
Basic Size
|
Total tags
|
Brown
|
87
|
179
|
Penn |
45
|
 
|
CLAWS1
|
132
|
 
|
CLAWS2
|
166
|
 
|
CLAWS c5 (BNC)
|
62
|
 
|
London
Lund
|
197
|
 
|
Brill says: "There are 192 tags in the Brown corpus, 96 of which
occur more than 100 times."
BNC Tagset ("Claws", C5)
One potential task is to define a tagset that
maximizes utility for parsing.
|
Brill's
tagger
|
 
|
Properties
- Rule based
- Rules are automatically learned.
A rule space.
Possible rule forms:
- If a word is tagged a and it is
in context C, then change that tag to b.
- If a word is tagged a and it has
lexical property P C, then change that tag to b.
- If a word is tagged a and and a woprd in region R
has lexical property P C, then change that tag to b.
Possible patch templates (rule templates):
Change tag a to tag b when:
- The preceding (following) word is tagged z.
- The word 2 after (before) is tagged z.
- One of the two following (preceding) words
is tagged z.
- One of the three following (preceding) words
is tagged z.
- The preceding word is tagged z and the following
word is tagged w.
- The preceding (following) word is tagged z and the
word 2 before (after) is tagged w.
- The current word is (is not) capitalized.
- The previous word is (is not) capitalized.
Training Algorithm (supervised):
- Initial training: For each word learn its most
frequent tag. [Training corpus, 90%] [Note: This
is the first place where lexically
specific info comes into play.]
- Patch acquisition [Development Test corpus,5%]:
- Collect a list of error triples in
the form [Taga, Tagb, Number]
- For each error triple and each patch template, find the patch
template that gives the best net error gain, where
net error gain = errors removed - errors added
and add that patch to the patch list.
- Runtime[Test Corpus:5%]
- Tag the test corpus using the initial training tagger.
- Revise for each word w
and each patch template changing tag a to
tag b whenever word w occurs in the
training corpus with tag b. [Note: This
is the second place where lexically
specific info comes into play.]
Note: there are two simple but important
refinements of the initial training having to do with
the treatment of unknown words.
- Capitalized unknown words are tagged as proper names.
- For other unknown words, assign the tag most common for words
ending in the same 3 letters:
blahblahous
gets tagged an adjective.
Some sample rules found by Brill's algorithm:
- TO IN NEXT-TAG AT
- VBN VBD PREV-WORD-IS-CAP YES
- VBD VBN PREV-1-OR-2-OR-3-TAG HVD
- TO IN NEXT-WORD-IS-CAP YES
- NN VB PREV-TAG MD
- PPS PPO NEXT-TAG .
- VBN VBD PREV-TAG PPS
- NP NN CURRENT-WORD-IS-CAP NO
Key:
TO Infinitval to
AT Article
IN Preposition
VBN past tense Verb
VBN past participle Verb
NP Proper Noun
NN Common Noun
MD Modal
PPO Objective (Accusative) Personal Pronoun
PPS Subject (Nominative) Personal Pronoun
HVD Had
Summary:
- Competitive with statistical taggers.
- Portable, doesnt depend on any particular tagset/corpus
properties
- Simple/low memory overhead.
|