libpostal

Author	SHA1	Message	Date
Gregory Oschwald	18cc0e37e6	Only create parser response when it is used Previously, an unused response would not be freed, causing a leak.	2018-01-02 11:56:02 -08:00
Gregory Oschwald	1bb6278446	Fix leak of normalized value in early return	2017-12-25 19:33:07 -08:00
AeroXuk	9090811826	Modifed the libpostal API to add an extra function libpostal_parser_print_features to toggle debugging info. Updated address_parser app to use the new function.	2017-11-27 19:20:37 +00:00
Al	cea3ced533	[fix] open files in binary format for #69	2017-05-03 17:34:38 -04:00
Al	8742574257	[parser] storing address_parser_context on the parser struct itself so it doesn't have to be allocated every time	2017-04-04 20:40:55 -04:00
Al	6d4c7984df	[api] doing this now since we're bumping a major version. Using a libpostal prefixes for all public header functions and definitions	2017-03-31 03:35:51 -04:00
Al	3b9b43f1b5	[fix] handle multiple separators (like parens used in https://www.openstreetmap.org/node/244081449 ). Creates bad trie entries otherwise, which affect more than just that toponym	2017-03-18 06:09:52 -04:00
Al	c67678087f	[parser] using a bipartite graph (indptr + indices) to represent postal code<=>admin relationships instead of a set of 64-bit ints. Requires \|V(postal codes)\| + \|E\| 32 bit ints instead of \|E\| 64 bit ints. Saves several hundred MB in file size and even more space in memory because of the hashtable overhead	2017-03-18 06:05:28 -04:00
Al	0b27eb3f74	[parser] thought numeric boundary names had already been removed in the source data, but someehow they've made it into one of the data sets. Doing a final check in context_fill for valid boundary names (currently valid if there's at least one non-digit token)	2017-03-15 13:07:21 -04:00
Al	1a1f0a44d2	[parser] parser only inserts spaces in the output if there were spaces (or other ignorable tokens) in the normalized input	2017-03-15 03:35:03 -04:00
Al	8deb1716cb	[parser] adding polymorphic (as much as C does polymorphism) model type for the parser to allow it to handle either the greedy averaged perceptron or a CRF. During training, saving, and loading, we use a different filename for a parser trained with a CRF, which is still backward-compatible with models previously trained in parser-data. Making necessary modifications to address_parser.c, address_parser_train.c, and address_parser_test.c. Also adding an option in address_parser_test to print individual errors in addition to the confusion matrix.	2017-03-10 19:28:21 -05:00
Al	f9e60b13f5	[parser] size the postcode context set appropriately when reading the parser, makes loading a large model much faster	2017-03-09 14:31:12 -05:00
Al	a63c182e96	[parser] right context affixes need to use pre-normalized words as well	2017-03-08 13:51:36 -05:00
Al	ce9153d94d	[parser] fixing some issues in address_parser_features. Prefix/suffix phrases use the word before token-level normalization (but after string-level normalization like lowercasing), needed to use the same string in the feature function as in address_parser_context_fill. Affects some German suffixes like "str." where the final "." would be deleted in token normalization, but the suffix length would include it. Also, three of the new arrays used in address_parser_context (suffix_phrases, prefix_phrases, and sub_tokens) weren't being cleared per call, which means computing the wrong features at best and a segfault at worst	2017-03-07 17:30:53 -05:00
Al	754f22c79a	[parser] moving feature printing to averaged perceptron tagger, taking advantage of trie prefix-sharing in feature incorporating previous tags	2017-03-06 20:32:50 -05:00
Al	839a13577d	[parser] fixing affix-related valgrind errors in address parser features	2017-03-06 20:28:42 -05:00
Al	dd4f3eb84c	[parser] simpler feature names for the state-transition features	2017-03-06 15:25:10 -05:00
Al	6a079e86b3	[fix] using size_t instead of int in address_parser/address_parser_train	2017-02-20 19:22:13 -08:00
Al	8ea5405c20	[parser] using separate arrays for features requiring tag history and making the tagger responsible for those features so the feature function does not require passing in prev and prev2 explicitly (i.e. don't need to run the feature function multiple times if using global best-sequence prediction)	2017-02-19 14:21:58 -08:00
Al	da856ea5c3	[parser] adding phrase features for category, unit, level, entrance, staircase, and po_box phrases from the libpostal dictionaries, excluding phrases which match the toponyms dictionary (e.g. US states that can also be found in street/venue names, useful for expansion but not here), if the current token is part of both an address dictionary phrase and a component phrase derived from the training data, use the longer of the two, or both if they are the same length	2017-02-17 03:00:48 -05:00
Al	c380b3e91b	[parser] phrase search with address dictionaries should not use the language given at training time since it's not currently available at runtime (without pulling in the language classifier, which may be warranted at some point, especially if the model can be made smaller/sparser)	2017-02-15 22:32:30 -05:00
Al	8abfa766fd	[fix] paren	2017-02-15 02:26:18 -05:00
Al	8eafc5730b	[parser] adding long-context features which help classify the first token in the string by finding the relative positions of a) the first numeric token and b) the first street-level phrase like "Ave" or "Calle"	2017-02-14 18:42:51 -05:00
Al	b570855b78	[parser] adding postcode context features and associated data structures to the parser. Masking digits, which should hopefully help with generalization. Creating positive/negative features for postcode with and without context support. Note: even with known postcodes in known contexts, only use the masked digits to avoid creating too many features that are redundant with the index.	2017-02-10 03:41:14 -05:00
Al	1aacb5bccc	Merge branch 'master' into parser-data	2017-02-09 15:09:28 -05:00
Al	38c6c26146	[fix] freeing normalized string in address_parser_parse	2017-02-09 14:33:13 -05:00
Al	0380f565d2	[parser] shorter first word feature	2017-01-29 22:10:28 -05:00
Al	b320aed9ac	[merge] merging master	2017-01-13 19:58:49 -05:00
Al	df89387b5c	[fix] calloc instead of malloc when performing initialization on structs that may fail halfway and need to clean up while partially initialized (calloc will set all the bytes to zero so the member pointers are NULL instead of garbage memory)	2017-01-13 18:30:04 -05:00
Al	7a8f94330b	[parser] only adding ngrams in a hyphenated word if the subword is not rare	2017-01-09 02:53:33 -05:00
Al	db16e656ca	[parser/cli] adding .print_features option in address_parser client for debugging	2016-12-31 00:20:35 -05:00
Al	acd953ce51	[parser] first pass at new parser feature extraction - removing geodb phrases - use Latin-ASCII-simple transliteration (no umlauts, etc.) - no digit normalization for admin component phrases and postcodes - tag = START + word, special feature for first word in the sequence - add the new admin boundary categories - for hyphenated non-phrase words, add each sub-word - for rare and unknown words, add ngram features of 3-6 characters with underscores to indicate beginnings and endings (similar to language classifier features) - defines notion of "rare words" (known words with a frequency <= n where n > the unknown word threshold), so known words can share statistical strength with artificial and real unknown words	2016-12-29 02:17:35 -05:00
Al	6f37f9ae86	[merge] merging in master changes	2016-12-21 15:40:25 -05:00
Al	c6af5cc071	[parser] Adding country_region label to parser as a boundary component	2016-07-28 15:19:48 -04:00
Al	44908ff95a	[parser] No digit normalization in training data-derived parser phrases (for postcodes, etc.), phrases include the new island type, house number phrases if any are valid. Adjacent words are now full phrases if they are part of a multiword token like a city name. For hyphenated names like Carmel-by-the-Sea, adding a version to the phrase dictionary where the hyphens are replaced with spaces	2016-07-21 17:04:57 -04:00
Al	0a8f46bdc3	[parser] Using new geonames designations in parser features	2016-07-21 17:04:57 -04:00
Al	e816b4f77e	[parser] Ignore language/country options explicitly in the parser. The purpose of these options is not to be able to create language-specific/country-specific models at some point, shouldn't be used in the global model	2016-07-06 14:56:46 -04:00
Al	1b94727871	[fix] Check that parser is loaded in parse_address, log and return NULL instead of segfaulting	2016-03-21 18:04:26 -04:00
Al	d4143c1685	[parsing] Adding an optimization to the parser API where, if the entire input is a single known geographic phrase like New York, it returns the most likely label from the training data. That way e.g. a search for 'Florida' doesn't get tagged as 'house.' This doesn't affect training, only prediction.	2016-01-15 20:07:21 -05:00
Al	b9bf5c629e	[fix] Moving address_parser_response_destroy into libpostal so caller can free	2015-12-15 00:52:24 -05:00
Al	fe4c528f26	[parser] Using different char_array for each of the potential phrases as token i	2015-12-12 03:23:26 -05:00
Al	e6303f70f3	[fix] removing printf	2015-12-11 02:53:22 -05:00
Al	88b8023ac8	[fix] Bug in address parser feature extraction, can hold onto the wrong pointer	2015-12-10 18:42:28 -05:00
Al	cfd0dc69f2	[parsing] Using the entire phrase as the ith word	2015-12-07 01:19:38 -05:00
Al	24208c209f	[parsing] Adding a training data derived index of complete phrases from suburb up to country. Only adding bias and word features for non phrases, using UNKNOWN_WORD and UNKNOWN_NUMERIC for infrequent tokens (not meeting minimum vocab count threshold).	2015-12-05 14:34:19 -05:00
Al	89677d94a3	[parsing] Initial commit of the address parser, training/testing, feature function, I/O	2015-11-30 14:48:13 -05:00

46 Commits