Package fsa_recognizer :: Module read_fst_file

Module read_fst_file

This module contains code for turning different file representations of FSTs into Python FST reps: a list of dictionaries (with upper-lg keys) of dictionaries (with lower lgs keys). These can then be used by the transducer generation and analysis code in fsa_recognizer.fta_rec.

The toplevel function is read_fst_file.make_fst_from_file, which takes a compiler instance (particular to the format of the file being read) and a filename argument, returning an FST.

Implemented formats include ATT graphviz ("dot") files and xerox xfst ".net" files.

Classes

[hide private]

FstCompiler
Class for compiling FST representations in various file formats into equivalent raw Python dictionaries such that:

DotFstCompiler
Code for compiling an ATT graphviz ("dot") representation of an FST into a Python FST rep.

XfstCompiler
Code for compiling an Xerox xfst representation of an FST into a Python FST rep.

Functions

[hide private]

make_fst_from_file(file, Compiler)
Open source file file, call the appropriate compiler CompilerInst, and turn it into a Python dictionary based represntation of an FST, suitable for the code in in fsa_recognizer.fta_rec. source code

make_fst(feasible_pairs, raw_dictionary, Compiler)
We assume states are string reps of ints and the state numbered 0 is always initial state.

source code

a set (of pairs of chars)

get_feasible_pairs(raw_dic, Compiler)
Assumptions for FST:

source code

add_to_char_map_if_necessary(char_map, char)

source code

list of characters

expand_char_interval(low, high)
Expand a char interval rep like a..z into a list of characters.

source code

blank_line(line)

source code

Variables

[hide private]

alphabet = 'abcdefghijklmnopqrstuvwxyz'

graph_type_re = re.compile(r'digraph')

blank_line_re = re.compile(r'\s+$')

node_dec_re = re.compile(r'\s*node\s*\[(.*)\]\s*;')

node_name_re = re.compile(r'\s*(\w+)\s*\[(.*)\]\s*;')

node_connection_re = re.compile(r'\s*(\w+)\s*->\s*(\w+)\s*\[(....

label_re = re.compile(r'[ ,\{\}]')

char_interval_re = re.compile(r'(.)\.\.(.)')

att_val_re = re.compile(r'\s*,?(\w+)\s*=\s*(\d+)')

att_val_re2 = re.compile(r'\s*,?(\w+)\s*=\s*"(.+?)"')

flags_re = re.compile(r'Flags:(.*)$')

net_dec_re = re.compile(r'Net:')

sigma_dec_re = re.compile(r'Sigma:(.*)$')

size_dec_re = re.compile(r'Size:\s*(\d+),')

arity_dec_re = re.compile(r'Arity:\s*(\d+)')

state_desc_re = re.compile(r'(.?s\d+):(:?\s*(.+)\s*->\s*(.+)\....

Function Details

[hide private]

make_fst_from_file(file, Compiler)

source code

Open source file file, call the appropriate compiler CompilerInst, and turn it into a Python dictionary based represntation of an FST, suitable for the code in in fsa_recognizer.fta_rec.

Assumptions for the dot file:

node names are string reps of integers
The node numbered 0 is the start node
epsilon rep is as defined in upper/lower char_map.

Assumptions for FST:

upper language comes FIRST in state-transition fns (STFs): That is, an STF is a function from upper-language chars to functions from lower lang chars to state-sets (FSA transition functions on the lower language).

Parameters:

file - string (filename path) @rtype fst (pair of dictionary and tuple)

make_fst(feasible_pairs, raw_dictionary, Compiler)

source code

We assume states are string reps of ints and the state numbered 0 is always initial state.

We create an FST whose state dictionaries are all totally defined for the set of feasiable pairs, filling in the gaps in raw_dictionary with transitions to ('und',).

We sort the dictionary items according to the individual compiler's cmp fn, assuming that the start state will get ranked first.

We assume every state fn in raw_dic is a dictionary. Two keys are special:

'final' takes a boolean value specifying whether this is a final state.
'start' takes a boolean value specifying whether this is a start state. (not used)

All other keys are upper lg chars whose values are dictionaries whose keys are lower lg chars:

  raw_dic[state][upper][lower] = the set of states transitioned to
                                 in teh state C{state} when upper
                                 char C{char} is paired with lower lg
                                 char C{lower}

get_feasible_pairs(raw_dic, Compiler)

source code

Assumptions for FST:

The set of feasible pairs includes some initial set passed in as initial_freasible_pairs. Usually this is the identity mapping for all lower case alphabetic characters.
All other possible mappings are instantiated in the raw dictionary as pairings of chars in the upper and lower alphabet.

Search the raw_dic rep of the FST, adding any new pairs found to the initial mapping.

Parameters:

raw_dic - a dictionary

Returns: a set (of pairs of chars)

expand_char_interval(low, high)

source code

Expand a char interval rep like a..z into a list of characters.

Parameters:

low - character the lower bound character in the interval
high - character the upper bound character in the interval

Returns: list of characters

Variables Details

[hide private]

node_connection_re

Value:

re.compile(r'\s*(\w+)\s*->\s*(\w+)\s*\[(.*)\]\s*;')

state_desc_re

Value:

re.compile(r'(.?s\d+):(:?\s*(.+)\s*->\s*(.+)\.|\s*\(no arcs\))')