libpostal

Author	SHA1	Message	Date
Al	182d60b623	[fix] removing include	2017-02-23 22:45:03 -05:00
Al	6a079e86b3	[fix] using size_t instead of int in address_parser/address_parser_train	2017-02-20 19:22:13 -08:00
Al	8ea5405c20	[parser] using separate arrays for features requiring tag history and making the tagger responsible for those features so the feature function does not require passing in prev and prev2 explicitly (i.e. don't need to run the feature function multiple times if using global best-sequence prediction)	2017-02-19 14:21:58 -08:00
Al	ba0ccc82a3	[fix] var name in address_parser_train	2017-02-15 22:22:33 -05:00
Al	ff245d74f8	[parser] building an index of postal codes and their valid admin contexts (city, state, country, etc.) during training e.g. "11216" => ["brooklyn", "ny"]. Postal code phrases like CP in Spanish are removed when constructing the index.	2017-02-10 00:50:48 -05:00
Al	174529e8d0	[parser] remove geodb and fix small memory leak in address_parser_train	2016-12-29 02:12:06 -05:00
Al	4677874610	[parser] stripping postal codes of phrases like CP (in Spanish) before adding them to the gazetteers, whether it's concatenated or a separate token. Adding a command-line argument for the number of iterations	2016-11-30 15:58:03 -08:00
Al	1b09b7f2e5	[fix] Adding country_region to address_parser_train	2016-07-28 16:18:32 -04:00
Al	44908ff95a	[parser] No digit normalization in training data-derived parser phrases (for postcodes, etc.), phrases include the new island type, house number phrases if any are valid. Adjacent words are now full phrases if they are part of a multiword token like a city name. For hyphenated names like Carmel-by-the-Sea, adding a version to the phrase dictionary where the hyphens are replaced with spaces	2016-07-21 17:04:57 -04:00
Al	16501aba17	[fix] Need to load transliteration module for Latin-ASCII normalization	2016-07-21 17:04:57 -04:00
Al	6ef7c90278	[fix] using string_equals, handles NULLs	2016-01-05 14:08:10 -05:00
Al	24208c209f	[parsing] Adding a training data derived index of complete phrases from suburb up to country. Only adding bias and word features for non phrases, using UNKNOWN_WORD and UNKNOWN_NUMERIC for infrequent tokens (not meeting minimum vocab count threshold).	2015-12-05 14:34:19 -05:00
Al	116fe857db	[parser] gshuf (Mac equivalent of shuf) is quite a bit slower than shuf, so removing it. Need to train on Linux unless a better alternative is found for shuffling large files on Mac	2015-12-01 11:24:44 -05:00
Al	89677d94a3	[parsing] Initial commit of the address parser, training/testing, feature function, I/O	2015-11-30 14:48:13 -05:00

14 Commits