libpostal

Author	SHA1	Message	Date
Al	4677874610	[parser] stripping postal codes of phrases like CP (in Spanish) before adding them to the gazetteers, whether it's concatenated or a separate token. Adding a command-line argument for the number of iterations	2016-11-30 15:58:03 -08:00
Al	1b09b7f2e5	[fix] Adding country_region to address_parser_train	2016-07-28 16:18:32 -04:00
Al	44908ff95a	[parser] No digit normalization in training data-derived parser phrases (for postcodes, etc.), phrases include the new island type, house number phrases if any are valid. Adjacent words are now full phrases if they are part of a multiword token like a city name. For hyphenated names like Carmel-by-the-Sea, adding a version to the phrase dictionary where the hyphens are replaced with spaces	2016-07-21 17:04:57 -04:00
Al	16501aba17	[fix] Need to load transliteration module for Latin-ASCII normalization	2016-07-21 17:04:57 -04:00
Al	6ef7c90278	[fix] using string_equals, handles NULLs	2016-01-05 14:08:10 -05:00
Al	24208c209f	[parsing] Adding a training data derived index of complete phrases from suburb up to country. Only adding bias and word features for non phrases, using UNKNOWN_WORD and UNKNOWN_NUMERIC for infrequent tokens (not meeting minimum vocab count threshold).	2015-12-05 14:34:19 -05:00
Al	116fe857db	[parser] gshuf (Mac equivalent of shuf) is quite a bit slower than shuf, so removing it. Need to train on Linux unless a better alternative is found for shuffling large files on Mac	2015-12-01 11:24:44 -05:00
Al	89677d94a3	[parsing] Initial commit of the address parser, training/testing, feature function, I/O	2015-11-30 14:48:13 -05:00

8 Commits