Al
7a8f94330b
[parser] only adding ngrams in a hyphenated word if the subword is not rare
2017-01-09 02:53:33 -05:00
Al
db16e656ca
[parser/cli] adding .print_features option in address_parser client for debugging
2016-12-31 00:20:35 -05:00
Al
acd953ce51
[parser] first pass at new parser feature extraction
...
- removing geodb phrases
- use Latin-ASCII-simple transliteration (no umlauts, etc.)
- no digit normalization for admin component phrases and postcodes
- tag = START + word, special feature for first word in the sequence
- add the new admin boundary categories
- for hyphenated non-phrase words, add each sub-word
- for rare and unknown words, add ngram features of 3-6 characters with
underscores to indicate beginnings and endings (similar to language
classifier features)
- defines notion of "rare words" (known words with a frequency <= n where
n > the unknown word threshold), so known words can share
statistical strength with artificial and real unknown words
2016-12-29 02:17:35 -05:00
Al
6f37f9ae86
[merge] merging in master changes
2016-12-21 15:40:25 -05:00
Al
c6af5cc071
[parser] Adding country_region label to parser as a boundary component
2016-07-28 15:19:48 -04:00
Al
44908ff95a
[parser] No digit normalization in training data-derived parser phrases (for postcodes, etc.), phrases include the new island type, house number phrases if any are valid. Adjacent words are now full phrases if they are part of a multiword token like a city name. For hyphenated names like Carmel-by-the-Sea, adding a version to the phrase dictionary where the hyphens are replaced with spaces
2016-07-21 17:04:57 -04:00
Al
0a8f46bdc3
[parser] Using new geonames designations in parser features
2016-07-21 17:04:57 -04:00
Al
e816b4f77e
[parser] Ignore language/country options explicitly in the parser. The purpose of these options is not to be able to create language-specific/country-specific models at some point, shouldn't be used in the global model
2016-07-06 14:56:46 -04:00
Al
1b94727871
[fix] Check that parser is loaded in parse_address, log and return NULL instead of segfaulting
2016-03-21 18:04:26 -04:00
Al
d4143c1685
[parsing] Adding an optimization to the parser API where, if the entire input is a single known geographic phrase like New York, it returns the most likely label from the training data. That way e.g. a search for 'Florida' doesn't get tagged as 'house.' This doesn't affect training, only prediction.
2016-01-15 20:07:21 -05:00
Al
b9bf5c629e
[fix] Moving address_parser_response_destroy into libpostal so caller can free
2015-12-15 00:52:24 -05:00
Al
fe4c528f26
[parser] Using different char_array for each of the potential phrases as token i
2015-12-12 03:23:26 -05:00
Al
e6303f70f3
[fix] removing printf
2015-12-11 02:53:22 -05:00
Al
88b8023ac8
[fix] Bug in address parser feature extraction, can hold onto the wrong pointer
2015-12-10 18:42:28 -05:00
Al
cfd0dc69f2
[parsing] Using the entire phrase as the ith word
2015-12-07 01:19:38 -05:00
Al
24208c209f
[parsing] Adding a training data derived index of complete phrases from suburb up to country. Only adding bias and word features for non phrases, using UNKNOWN_WORD and UNKNOWN_NUMERIC for infrequent tokens (not meeting minimum vocab count threshold).
2015-12-05 14:34:19 -05:00
Al
89677d94a3
[parsing] Initial commit of the address parser, training/testing, feature function, I/O
2015-11-30 14:48:13 -05:00