libpostal

Author	SHA1	Message	Date
Al	7a8f94330b	[parser] only adding ngrams in a hyphenated word if the subword is not rare	2017-01-09 02:53:33 -05:00
Al	db16e656ca	[parser/cli] adding .print_features option in address_parser client for debugging	2016-12-31 00:20:35 -05:00
Al	acd953ce51	[parser] first pass at new parser feature extraction - removing geodb phrases - use Latin-ASCII-simple transliteration (no umlauts, etc.) - no digit normalization for admin component phrases and postcodes - tag = START + word, special feature for first word in the sequence - add the new admin boundary categories - for hyphenated non-phrase words, add each sub-word - for rare and unknown words, add ngram features of 3-6 characters with underscores to indicate beginnings and endings (similar to language classifier features) - defines notion of "rare words" (known words with a frequency <= n where n > the unknown word threshold), so known words can share statistical strength with artificial and real unknown words	2016-12-29 02:17:35 -05:00
Al	6f37f9ae86	[merge] merging in master changes	2016-12-21 15:40:25 -05:00
Al	c6af5cc071	[parser] Adding country_region label to parser as a boundary component	2016-07-28 15:19:48 -04:00
Al	44908ff95a	[parser] No digit normalization in training data-derived parser phrases (for postcodes, etc.), phrases include the new island type, house number phrases if any are valid. Adjacent words are now full phrases if they are part of a multiword token like a city name. For hyphenated names like Carmel-by-the-Sea, adding a version to the phrase dictionary where the hyphens are replaced with spaces	2016-07-21 17:04:57 -04:00
Al	0a8f46bdc3	[parser] Using new geonames designations in parser features	2016-07-21 17:04:57 -04:00
Al	e816b4f77e	[parser] Ignore language/country options explicitly in the parser. The purpose of these options is not to be able to create language-specific/country-specific models at some point, shouldn't be used in the global model	2016-07-06 14:56:46 -04:00
Al	1b94727871	[fix] Check that parser is loaded in parse_address, log and return NULL instead of segfaulting	2016-03-21 18:04:26 -04:00
Al	d4143c1685	[parsing] Adding an optimization to the parser API where, if the entire input is a single known geographic phrase like New York, it returns the most likely label from the training data. That way e.g. a search for 'Florida' doesn't get tagged as 'house.' This doesn't affect training, only prediction.	2016-01-15 20:07:21 -05:00
Al	b9bf5c629e	[fix] Moving address_parser_response_destroy into libpostal so caller can free	2015-12-15 00:52:24 -05:00
Al	fe4c528f26	[parser] Using different char_array for each of the potential phrases as token i	2015-12-12 03:23:26 -05:00
Al	e6303f70f3	[fix] removing printf	2015-12-11 02:53:22 -05:00
Al	88b8023ac8	[fix] Bug in address parser feature extraction, can hold onto the wrong pointer	2015-12-10 18:42:28 -05:00
Al	cfd0dc69f2	[parsing] Using the entire phrase as the ith word	2015-12-07 01:19:38 -05:00
Al	24208c209f	[parsing] Adding a training data derived index of complete phrases from suburb up to country. Only adding bias and word features for non phrases, using UNKNOWN_WORD and UNKNOWN_NUMERIC for infrequent tokens (not meeting minimum vocab count threshold).	2015-12-05 14:34:19 -05:00
Al	89677d94a3	[parsing] Initial commit of the address parser, training/testing, feature function, I/O	2015-11-30 14:48:13 -05:00

17 Commits