Commit Graph

15 Commits

Author SHA1 Message Date
Al
db16e656ca [parser/cli] adding .print_features option in address_parser client for debugging 2016-12-31 00:20:35 -05:00
Al
acd953ce51 [parser] first pass at new parser feature extraction
- removing geodb phrases
- use Latin-ASCII-simple transliteration (no umlauts, etc.)
- no digit normalization for admin component phrases and postcodes
- tag = START + word, special feature for first word in the sequence
- add the new admin boundary categories
- for hyphenated non-phrase words, add each sub-word
- for rare and unknown words, add ngram features of 3-6 characters with
  underscores to indicate beginnings and endings (similar to language
  classifier features)
- defines notion of "rare words" (known words with a frequency <= n where
  n > the unknown word threshold), so known words can share
  statistical strength with artificial and real unknown words
2016-12-29 02:17:35 -05:00
Al
22c4e99ea0 [parser] As part of reading/tokenizing the address parser data set,
several copies of the same training example will be generated.

1. with only lowercasing
2. with simple Latin-ASCII normalization (no umlauts, only things that
are common to all languages)
3. basic UTF-8 normalizations (accent stripping)
4. language-specific Latin-ASCII transliteration (e.g. ü => ue in German)

This will apply both on the initial passes when building the phrase
gazetteers and during each iteration of training. In this way, only the
most basic normalizations like lowercasing need to be done at runtime
and it's possible to use only minimal normalizations like lowercasing.

May have a small effect on randomization as examples are created in a
deterministic order. However, this should not lead to cycles since the
base examples are shuffled, thus still satisfying the random permutation
requirement of an online/stochastic learning algorithm.
2016-12-02 13:09:03 -05:00
Al
20aad99a38 [parser] enum just lists boundary types 2016-07-30 17:07:23 -04:00
Al
1b09b7f2e5 [fix] Adding country_region to address_parser_train 2016-07-28 16:18:32 -04:00
Al
c6af5cc071 [parser] Adding country_region label to parser as a boundary component 2016-07-28 15:19:48 -04:00
Al
44908ff95a [parser] No digit normalization in training data-derived parser phrases (for postcodes, etc.), phrases include the new island type, house number phrases if any are valid. Adjacent words are now full phrases if they are part of a multiword token like a city name. For hyphenated names like Carmel-by-the-Sea, adding a version to the phrase dictionary where the hyphens are replaced with spaces 2016-07-21 17:04:57 -04:00
Al
c383f8af88 [parser] Using NFC normalization for parser as well, @ sign not defined as separator since it may also be used in intersections 2016-07-21 17:04:57 -04:00
Al
d4143c1685 [parsing] Adding an optimization to the parser API where, if the entire input is a single known geographic phrase like New York, it returns the most likely label from the training data. That way e.g. a search for 'Florida' doesn't get tagged as 'house.' This doesn't affect training, only prediction. 2016-01-15 20:07:21 -05:00
Al
b9bf5c629e [fix] Moving address_parser_response_destroy into libpostal so caller can free 2015-12-15 00:52:24 -05:00
Al
bce6ba2595 [fix] typedef 2015-12-12 11:58:41 -05:00
Al
a8d6cc4053 [api] Moving parse_address definition into libpostal.h 2015-12-12 03:55:31 -05:00
Al
3de59506ae [parser] Internal separators for parsing purposes include open/close parens, at sign, semicolon, etc. Ignore stray colons not internal to a word (as in Swedish abbreviations) 2015-12-10 18:08:51 -05:00
Al
f41158b8b3 [osm] Avoid using the alternate name (e.g. Brooklyn instead of Kings County) when it is the same as city 2015-12-05 14:21:07 -05:00
Al
89677d94a3 [parsing] Initial commit of the address parser, training/testing, feature function, I/O 2015-11-30 14:48:13 -05:00