[parser] first pass at new parser feature extraction

- removing geodb phrases
- use Latin-ASCII-simple transliteration (no umlauts, etc.)
- no digit normalization for admin component phrases and postcodes
- tag = START + word, special feature for first word in the sequence
- add the new admin boundary categories
- for hyphenated non-phrase words, add each sub-word
- for rare and unknown words, add ngram features of 3-6 characters with
  underscores to indicate beginnings and endings (similar to language
  classifier features)
- defines notion of "rare words" (known words with a frequency <= n where
  n > the unknown word threshold), so known words can share
  statistical strength with artificial and real unknown words
This commit is contained in:
Al
2016-12-29 02:17:05 -05:00
parent e62101b8bf
commit acd953ce51
2 changed files with 492 additions and 295 deletions

File diff suppressed because it is too large Load Diff