Al
6a079e86b3
[fix] using size_t instead of int in address_parser/address_parser_train
2017-02-20 19:22:13 -08:00
Al
8ea5405c20
[parser] using separate arrays for features requiring tag history and making the tagger responsible for those features so the feature function does not require passing in prev and prev2 explicitly (i.e. don't need to run the feature function multiple times if using global best-sequence prediction)
2017-02-19 14:21:58 -08:00
Al
da856ea5c3
[parser] adding phrase features for category, unit, level, entrance, staircase, and po_box phrases from the libpostal dictionaries, excluding phrases which match the toponyms dictionary (e.g. US states that can also be found in street/venue names, useful for expansion but not here), if the current token is part of both an address dictionary phrase and a component phrase derived from the training data, use the longer of the two, or both if they are the same length
2017-02-17 03:00:48 -05:00
Al
c380b3e91b
[parser] phrase search with address dictionaries should not use the language given at training time since it's not currently available at runtime (without pulling in the language classifier, which may be warranted at some point, especially if the model can be made smaller/sparser)
2017-02-15 22:32:30 -05:00
Al
8abfa766fd
[fix] paren
2017-02-15 02:26:18 -05:00
Al
8eafc5730b
[parser] adding long-context features which help classify the first token in the string by finding the relative positions of a) the first numeric token and b) the first street-level phrase like "Ave" or "Calle"
2017-02-14 18:42:51 -05:00
Al
b570855b78
[parser] adding postcode context features and associated data structures to the parser. Masking digits, which should hopefully help with generalization. Creating positive/negative features for postcode with and without context support. Note: even with known postcodes in known contexts, only use the masked digits to avoid creating too many features that are redundant with the index.
2017-02-10 03:41:14 -05:00
Al
1aacb5bccc
Merge branch 'master' into parser-data
2017-02-09 15:09:28 -05:00
Al
38c6c26146
[fix] freeing normalized string in address_parser_parse
2017-02-09 14:33:13 -05:00
Al
0380f565d2
[parser] shorter first word feature
2017-01-29 22:10:28 -05:00
Al
b320aed9ac
[merge] merging master
2017-01-13 19:58:49 -05:00
Al
df89387b5c
[fix] calloc instead of malloc when performing initialization on structs that may fail halfway and need to clean up while partially initialized (calloc will set all the bytes to zero so the member pointers are NULL instead of garbage memory)
2017-01-13 18:30:04 -05:00
Al
7a8f94330b
[parser] only adding ngrams in a hyphenated word if the subword is not rare
2017-01-09 02:53:33 -05:00
Al
db16e656ca
[parser/cli] adding .print_features option in address_parser client for debugging
2016-12-31 00:20:35 -05:00
Al
acd953ce51
[parser] first pass at new parser feature extraction
...
- removing geodb phrases
- use Latin-ASCII-simple transliteration (no umlauts, etc.)
- no digit normalization for admin component phrases and postcodes
- tag = START + word, special feature for first word in the sequence
- add the new admin boundary categories
- for hyphenated non-phrase words, add each sub-word
- for rare and unknown words, add ngram features of 3-6 characters with
underscores to indicate beginnings and endings (similar to language
classifier features)
- defines notion of "rare words" (known words with a frequency <= n where
n > the unknown word threshold), so known words can share
statistical strength with artificial and real unknown words
2016-12-29 02:17:35 -05:00
Al
6f37f9ae86
[merge] merging in master changes
2016-12-21 15:40:25 -05:00
Al
c6af5cc071
[parser] Adding country_region label to parser as a boundary component
2016-07-28 15:19:48 -04:00
Al
44908ff95a
[parser] No digit normalization in training data-derived parser phrases (for postcodes, etc.), phrases include the new island type, house number phrases if any are valid. Adjacent words are now full phrases if they are part of a multiword token like a city name. For hyphenated names like Carmel-by-the-Sea, adding a version to the phrase dictionary where the hyphens are replaced with spaces
2016-07-21 17:04:57 -04:00
Al
0a8f46bdc3
[parser] Using new geonames designations in parser features
2016-07-21 17:04:57 -04:00
Al
e816b4f77e
[parser] Ignore language/country options explicitly in the parser. The purpose of these options is not to be able to create language-specific/country-specific models at some point, shouldn't be used in the global model
2016-07-06 14:56:46 -04:00
Al
1b94727871
[fix] Check that parser is loaded in parse_address, log and return NULL instead of segfaulting
2016-03-21 18:04:26 -04:00
Al
d4143c1685
[parsing] Adding an optimization to the parser API where, if the entire input is a single known geographic phrase like New York, it returns the most likely label from the training data. That way e.g. a search for 'Florida' doesn't get tagged as 'house.' This doesn't affect training, only prediction.
2016-01-15 20:07:21 -05:00
Al
b9bf5c629e
[fix] Moving address_parser_response_destroy into libpostal so caller can free
2015-12-15 00:52:24 -05:00
Al
fe4c528f26
[parser] Using different char_array for each of the potential phrases as token i
2015-12-12 03:23:26 -05:00
Al
e6303f70f3
[fix] removing printf
2015-12-11 02:53:22 -05:00
Al
88b8023ac8
[fix] Bug in address parser feature extraction, can hold onto the wrong pointer
2015-12-10 18:42:28 -05:00
Al
cfd0dc69f2
[parsing] Using the entire phrase as the ith word
2015-12-07 01:19:38 -05:00
Al
24208c209f
[parsing] Adding a training data derived index of complete phrases from suburb up to country. Only adding bias and word features for non phrases, using UNKNOWN_WORD and UNKNOWN_NUMERIC for infrequent tokens (not meeting minimum vocab count threshold).
2015-12-05 14:34:19 -05:00
Al
89677d94a3
[parsing] Initial commit of the address parser, training/testing, feature function, I/O
2015-11-30 14:48:13 -05:00