libpostal

Author	SHA1	Message	Date
Al	8742574257	[parser] storing address_parser_context on the parser struct itself so it doesn't have to be allocated every time	2017-04-04 20:40:55 -04:00
Al	6d4c7984df	[api] doing this now since we're bumping a major version. Using a libpostal prefixes for all public header functions and definitions	2017-03-31 03:35:51 -04:00
Al	c67678087f	[parser] using a bipartite graph (indptr + indices) to represent postal code<=>admin relationships instead of a set of 64-bit ints. Requires \|V(postal codes)\| + \|E\| 32 bit ints instead of \|E\| 64 bit ints. Saves several hundred MB in file size and even more space in memory because of the hashtable overhead	2017-03-18 06:05:28 -04:00
Al	8deb1716cb	[parser] adding polymorphic (as much as C does polymorphism) model type for the parser to allow it to handle either the greedy averaged perceptron or a CRF. During training, saving, and loading, we use a different filename for a parser trained with a CRF, which is still backward-compatible with models previously trained in parser-data. Making necessary modifications to address_parser.c, address_parser_train.c, and address_parser_test.c. Also adding an option in address_parser_test to print individual errors in addition to the confusion matrix.	2017-03-10 19:28:21 -05:00
Al	39fa8ff1a5	[parser] counting num classes in address parser init for models where it is needed a priori	2017-03-06 15:17:52 -05:00
Al	8ea5405c20	[parser] using separate arrays for features requiring tag history and making the tagger responsible for those features so the feature function does not require passing in prev and prev2 explicitly (i.e. don't need to run the feature function multiple times if using global best-sequence prediction)	2017-02-19 14:21:58 -08:00
Al	8eafc5730b	[parser] adding long-context features which help classify the first token in the string by finding the relative positions of a) the first numeric token and b) the first street-level phrase like "Ave" or "Calle"	2017-02-14 18:42:51 -05:00
Al	a6844c8ec1	[parser] structural changes for postal codes index	2017-02-08 18:52:45 -05:00
Al	db16e656ca	[parser/cli] adding .print_features option in address_parser client for debugging	2016-12-31 00:20:35 -05:00
Al	acd953ce51	[parser] first pass at new parser feature extraction - removing geodb phrases - use Latin-ASCII-simple transliteration (no umlauts, etc.) - no digit normalization for admin component phrases and postcodes - tag = START + word, special feature for first word in the sequence - add the new admin boundary categories - for hyphenated non-phrase words, add each sub-word - for rare and unknown words, add ngram features of 3-6 characters with underscores to indicate beginnings and endings (similar to language classifier features) - defines notion of "rare words" (known words with a frequency <= n where n > the unknown word threshold), so known words can share statistical strength with artificial and real unknown words	2016-12-29 02:17:35 -05:00
Al	22c4e99ea0	[parser] As part of reading/tokenizing the address parser data set, several copies of the same training example will be generated. 1. with only lowercasing 2. with simple Latin-ASCII normalization (no umlauts, only things that are common to all languages) 3. basic UTF-8 normalizations (accent stripping) 4. language-specific Latin-ASCII transliteration (e.g. ü => ue in German) This will apply both on the initial passes when building the phrase gazetteers and during each iteration of training. In this way, only the most basic normalizations like lowercasing need to be done at runtime and it's possible to use only minimal normalizations like lowercasing. May have a small effect on randomization as examples are created in a deterministic order. However, this should not lead to cycles since the base examples are shuffled, thus still satisfying the random permutation requirement of an online/stochastic learning algorithm.	2016-12-02 13:09:03 -05:00
Al	20aad99a38	[parser] enum just lists boundary types	2016-07-30 17:07:23 -04:00
Al	1b09b7f2e5	[fix] Adding country_region to address_parser_train	2016-07-28 16:18:32 -04:00
Al	c6af5cc071	[parser] Adding country_region label to parser as a boundary component	2016-07-28 15:19:48 -04:00
Al	44908ff95a	[parser] No digit normalization in training data-derived parser phrases (for postcodes, etc.), phrases include the new island type, house number phrases if any are valid. Adjacent words are now full phrases if they are part of a multiword token like a city name. For hyphenated names like Carmel-by-the-Sea, adding a version to the phrase dictionary where the hyphens are replaced with spaces	2016-07-21 17:04:57 -04:00
Al	c383f8af88	[parser] Using NFC normalization for parser as well, @ sign not defined as separator since it may also be used in intersections	2016-07-21 17:04:57 -04:00
Al	d4143c1685	[parsing] Adding an optimization to the parser API where, if the entire input is a single known geographic phrase like New York, it returns the most likely label from the training data. That way e.g. a search for 'Florida' doesn't get tagged as 'house.' This doesn't affect training, only prediction.	2016-01-15 20:07:21 -05:00
Al	b9bf5c629e	[fix] Moving address_parser_response_destroy into libpostal so caller can free	2015-12-15 00:52:24 -05:00
Al	bce6ba2595	[fix] typedef	2015-12-12 11:58:41 -05:00
Al	a8d6cc4053	[api] Moving parse_address definition into libpostal.h	2015-12-12 03:55:31 -05:00
Al	3de59506ae	[parser] Internal separators for parsing purposes include open/close parens, at sign, semicolon, etc. Ignore stray colons not internal to a word (as in Swedish abbreviations)	2015-12-10 18:08:51 -05:00
Al	f41158b8b3	[osm] Avoid using the alternate name (e.g. Brooklyn instead of Kings County) when it is the same as city	2015-12-05 14:21:07 -05:00
Al	89677d94a3	[parsing] Initial commit of the address parser, training/testing, feature function, I/O	2015-11-30 14:48:13 -05:00

23 Commits