libpostal

Author	SHA1	Message	Date
Al	db16e656ca	[parser/cli] adding .print_features option in address_parser client for debugging	2016-12-31 00:20:35 -05:00
Al	acd953ce51	[parser] first pass at new parser feature extraction - removing geodb phrases - use Latin-ASCII-simple transliteration (no umlauts, etc.) - no digit normalization for admin component phrases and postcodes - tag = START + word, special feature for first word in the sequence - add the new admin boundary categories - for hyphenated non-phrase words, add each sub-word - for rare and unknown words, add ngram features of 3-6 characters with underscores to indicate beginnings and endings (similar to language classifier features) - defines notion of "rare words" (known words with a frequency <= n where n > the unknown word threshold), so known words can share statistical strength with artificial and real unknown words	2016-12-29 02:17:35 -05:00
Al	22c4e99ea0	[parser] As part of reading/tokenizing the address parser data set, several copies of the same training example will be generated. 1. with only lowercasing 2. with simple Latin-ASCII normalization (no umlauts, only things that are common to all languages) 3. basic UTF-8 normalizations (accent stripping) 4. language-specific Latin-ASCII transliteration (e.g. ü => ue in German) This will apply both on the initial passes when building the phrase gazetteers and during each iteration of training. In this way, only the most basic normalizations like lowercasing need to be done at runtime and it's possible to use only minimal normalizations like lowercasing. May have a small effect on randomization as examples are created in a deterministic order. However, this should not lead to cycles since the base examples are shuffled, thus still satisfying the random permutation requirement of an online/stochastic learning algorithm.	2016-12-02 13:09:03 -05:00
Al	20aad99a38	[parser] enum just lists boundary types	2016-07-30 17:07:23 -04:00
Al	1b09b7f2e5	[fix] Adding country_region to address_parser_train	2016-07-28 16:18:32 -04:00
Al	c6af5cc071	[parser] Adding country_region label to parser as a boundary component	2016-07-28 15:19:48 -04:00
Al	44908ff95a	[parser] No digit normalization in training data-derived parser phrases (for postcodes, etc.), phrases include the new island type, house number phrases if any are valid. Adjacent words are now full phrases if they are part of a multiword token like a city name. For hyphenated names like Carmel-by-the-Sea, adding a version to the phrase dictionary where the hyphens are replaced with spaces	2016-07-21 17:04:57 -04:00
Al	c383f8af88	[parser] Using NFC normalization for parser as well, @ sign not defined as separator since it may also be used in intersections	2016-07-21 17:04:57 -04:00
Al	d4143c1685	[parsing] Adding an optimization to the parser API where, if the entire input is a single known geographic phrase like New York, it returns the most likely label from the training data. That way e.g. a search for 'Florida' doesn't get tagged as 'house.' This doesn't affect training, only prediction.	2016-01-15 20:07:21 -05:00
Al	b9bf5c629e	[fix] Moving address_parser_response_destroy into libpostal so caller can free	2015-12-15 00:52:24 -05:00
Al	bce6ba2595	[fix] typedef	2015-12-12 11:58:41 -05:00
Al	a8d6cc4053	[api] Moving parse_address definition into libpostal.h	2015-12-12 03:55:31 -05:00
Al	3de59506ae	[parser] Internal separators for parsing purposes include open/close parens, at sign, semicolon, etc. Ignore stray colons not internal to a word (as in Swedish abbreviations)	2015-12-10 18:08:51 -05:00
Al	f41158b8b3	[osm] Avoid using the alternate name (e.g. Brooklyn instead of Kings County) when it is the same as city	2015-12-05 14:21:07 -05:00
Al	89677d94a3	[parsing] Initial commit of the address parser, training/testing, feature function, I/O	2015-11-30 14:48:13 -05:00

15 Commits