libpostal

Author	SHA1	Message	Date
Al	f9a9dc2224	[parser/crf] adding the beginnings of a linear-chain Conditional Random Field implementation for the address parser. One of the main issues with the greedy averaged perceptron tagger used currently in libpostal is that it predicts left-to-right and commits to its answers i.e. doesn't revise its previous predictions. The model can use its own previous predictions to classify the current word, but effectively it makes the best local decision it can and never looks back (the YOLO approach to parsing). This can be problematic in a multilingual setting like libpostal, since the order of address components is language/country dependent. It would be preferable to have a model that scores whole _sequences_ instead of individual tagging decisions. That's exactly what a Conditional Random Field (CRF) does. Instead of modeling P(y_i\|x_i, y_i-1), we're modeling P(y\|x) where y is the whole sequence of labels and x is the whole sequence of features. They achieve state-of-the-art results in many tasks (or are a component in the state-of-the-art model - LSTM-CRFs have been an interesting direction along these lines). The crf_context module is heavily borrowed from the version in CRFSuite (https://github.com/chokkan/crfsuite) though using libpostal's data structures and allowing for "state-transition features." CRFSuite has state features like "word=the", and transition features i.e. "prev tag=house", but no notion of a feature which incorporates both local and transition information e.g. "word=the and prev tag=house". These types of features are useful in our setting where there are many languages and it might not make as much sense to simply have a weight for "house_number => road" because that highly depends on the country. This implementation introduces a T x L^2 matrix for those state-transition scores. For linear-chain CRFs, the Viterbi algorithm is used for computing the most probable sequence. There are versions of Viterbi for computing the N most probable sequences as well, which may come in handy later. This can also compute marginal probabilities of a sequence (though it would need to wait until a gradient-based learning method that produces well-calibrated probabilities is implemented). The cool thing architecturally about crf_context as a separate module is that the weights can be learned through any method we want. As long as the state scores, state-transition scores, and transition scores are populated on the context struct, we have everything we need to run Viterbi inference, etc. without really caring about which training algorithm was used to optimize the weights, what the features are, how they're stored, etc. So far the results have been very encouraging. While it is slower to train a linear-chain CRF, and it will likely add several days to the training process, it's still reasonably fast at runtime and not all that slow at training time. In unscientific tests on a busy MacBook Pro, so far training has been chunking through ~3k addresses / sec, which is only about half the speed of the greedy tagger (haven't benchmarked the runtime difference but anecdotally it's hardly noticeable). Libpostal training runs considerably faster on Linux with gcc, so 3k might be a little low. I'd also guess that re-computing features every iteration means there's a limit on the performance of the greedy tagger. The differences might be more pronounced if features were pre-computed (a possible optimization).	2017-03-10 01:10:22 -05:00
Al	f9e60b13f5	[parser] size the postcode context set appropriately when reading the parser, makes loading a large model much faster	2017-03-09 14:31:12 -05:00
Al	2400122162	[fix] fixing up hash str to id template	2017-03-09 00:54:31 -05:00
Al	4c03e563e0	[parser] for the min updates method to work, the feature that have not yet reached the min_updates threshold also need to be ignored when scoring, that way the model has to perform without those features, and should make more updates if they're relevant	2017-03-08 15:40:12 -05:00
Al	a63c182e96	[parser] right context affixes need to use pre-normalized words as well	2017-03-08 13:51:36 -05:00
Al	ce9153d94d	[parser] fixing some issues in address_parser_features. Prefix/suffix phrases use the word before token-level normalization (but after string-level normalization like lowercasing), needed to use the same string in the feature function as in address_parser_context_fill. Affects some German suffixes like "str." where the final "." would be deleted in token normalization, but the suffix length would include it. Also, three of the new arrays used in address_parser_context (suffix_phrases, prefix_phrases, and sub_tokens) weren't being cleared per call, which means computing the wrong features at best and a segfault at worst	2017-03-07 17:30:53 -05:00
Al	b6bf8da383	[utils] adding aligned malloc/free/realloc in vector.h and matrix.h, fixing bug in matrix_copy	2017-03-07 16:25:34 -05:00
Al	242b1364ae	[parser] using new API in address_parser_test	2017-03-07 16:24:34 -05:00
Al	95015990ab	[parser] learning a sparser averaged perceptron model for the parser using the following method: - store a vector of update counts for each feature in the model - when the model updates after making a mistake, increment the update counters for the observed features in that example - after the model is finished training, keep only the features that participated in a minimum number of updates This method is described in greater detail in this paper from Yoav Goldberg: https://www.cs.bgu.ac.il/~yoavg/publications/acl2011sparse.pdf The authors there report a 4x size reduction at only a trivial cost in terms of accuracy. So far the trials on libpostal indicate roughly the same, though at lower training set sizes the accuracy cost is greater. This method is more effective than simple feature pruning as feature pruning methods are usually based on the frequency of the feature in the training set, and infrequent features can still be important. However, the perceptron's early iterations make many updates on irrelevant featuers simply because the weights for the more relevant features aren't tuned yet. The number of updates a feature participates in can be seen as a measure of its relevance to classifying examples. This commit introduces --min-features option to address_parser_train (default=5), so it can effectively be turned off by using "--min-features 0" or "--min-features 1".	2017-03-06 22:28:33 -05:00
Al	5c1c1ae0f2	[parser] moving tagger function pointer definition to a separate header so it can be used for other models	2017-03-06 21:42:06 -05:00
Al	cc58ec9db2	[parser] fix another valgrind error in parser training (cstring_array memory can get moved around when using string pointers obtained before adding to it, which can potentially cause a realloc), no longer using the dummy START tags as the feature function can choose to add features for those cases	2017-03-06 21:39:14 -05:00
Al	754f22c79a	[parser] moving feature printing to averaged perceptron tagger, taking advantage of trie prefix-sharing in feature incorporating previous tags	2017-03-06 20:32:50 -05:00
Al	839a13577d	[parser] fixing affix-related valgrind errors in address parser features	2017-03-06 20:28:42 -05:00
Al	c3581557a1	[parser] counting classes instead of keeping a set	2017-03-06 20:05:01 -05:00
Al	a5283cb313	[fix] trie_new_from_hash	2017-03-06 15:57:42 -05:00
Al	5113a1bc32	[utils] tracking keys added in trie construction from hash	2017-03-06 15:28:26 -05:00
Al	dd4f3eb84c	[parser] simpler feature names for the state-transition features	2017-03-06 15:25:10 -05:00
Al	39fa8ff1a5	[parser] counting num classes in address parser init for models where it is needed a priori	2017-03-06 15:17:52 -05:00
Al	5f19e63cbe	[parser] more logging in init	2017-03-06 15:11:39 -05:00
Al	bb922e4ce4	[parser] adding log message	2017-03-06 12:25:22 -05:00
Al	b97de96ab4	[parser] fixing chunked shuffle, making awk splitting work on Mac	2017-03-05 15:06:02 -05:00
Al	0e49fc580a	[parser] uint64_t chunk size, no warning if gshuf is available	2017-03-05 14:50:47 -05:00
Al	b76b7b8527	[parser] adding chunked shuffle as a C function (writes each line to one of n random files, runs shuf on each file and concatenates the result). Adding a version which allows specifying a specific chunk size, and using a 2GB limit for address parser training. Allowing gshuf again for Mac as it seems the only problem there was not having enough memory when testing on a Mac laptop. The new limited-memory version should be fast enough.	2017-03-05 02:15:11 -05:00
Al	e39d4d2f00	[parser] check for non-null prev/prev2 before creating tag-based features	2017-02-24 02:57:16 -05:00
Al	182d60b623	[fix] removing include	2017-02-23 22:45:03 -05:00
Al	6a079e86b3	[fix] using size_t instead of int in address_parser/address_parser_train	2017-02-20 19:22:13 -08:00
Al	8ea5405c20	[parser] using separate arrays for features requiring tag history and making the tagger responsible for those features so the feature function does not require passing in prev and prev2 explicitly (i.e. don't need to run the feature function multiple times if using global best-sequence prediction)	2017-02-19 14:21:58 -08:00
Al	715520f681	[parser] using new zeros API in averaged_perceptron.c	2017-02-19 14:02:54 -08:00
Al	b88487f633	[utils] string_replace_char does single byte/character replacement, new string_replace to do full string replacement, again using char_array for safety, string_replace_with_array function for memory reuse	2017-02-17 13:58:51 -05:00
Al	da856ea5c3	[parser] adding phrase features for category, unit, level, entrance, staircase, and po_box phrases from the libpostal dictionaries, excluding phrases which match the toponyms dictionary (e.g. US states that can also be found in street/venue names, useful for expansion but not here), if the current token is part of both an address dictionary phrase and a component phrase derived from the training data, use the longer of the two, or both if they are the same length	2017-02-17 03:00:48 -05:00
Al	c380b3e91b	[parser] phrase search with address dictionaries should not use the language given at training time since it's not currently available at runtime (without pulling in the language classifier, which may be warranted at some point, especially if the model can be made smaller/sparser)	2017-02-15 22:32:30 -05:00
Al	a3e51db32d	[api] include some of the new components in default address_components for the libpostal expansion API	2017-02-15 22:29:22 -05:00
Al	32fb483e96	[gazetteers] adding ADDRESS_PO_BOX component	2017-02-15 22:23:28 -05:00
Al	ba0ccc82a3	[fix] var name in address_parser_train	2017-02-15 22:22:33 -05:00
Al	0196fe8736	[utils] fixing key_type in hash_get, adding int64_double map	2017-02-15 22:20:36 -05:00
Al	8abfa766fd	[fix] paren	2017-02-15 02:26:18 -05:00
Al	8eafc5730b	[parser] adding long-context features which help classify the first token in the string by finding the relative positions of a) the first numeric token and b) the first street-level phrase like "Ave" or "Calle"	2017-02-14 18:42:51 -05:00
Al	56f68e4399	[phrases] fixing trie suffix search	2017-02-14 03:36:29 -05:00
Al	2f4bcaeec2	[parser] address_parser_test memory cleanup, add print-errors option to print individual parser errors on held-out data	2017-02-12 16:05:11 -05:00
Al	b1e178b7b2	[fix] is_numeric_token includes IDEOGRAPHIC_NUMBER	2017-02-12 15:11:56 -05:00
Al	b570855b78	[parser] adding postcode context features and associated data structures to the parser. Masking digits, which should hopefully help with generalization. Creating positive/negative features for postcode with and without context support. Note: even with known postcodes in known contexts, only use the masked digits to avoid creating too many features that are redundant with the index.	2017-02-10 03:41:14 -05:00
Al	9a93e95938	[api] removing geodb from setup functions	2017-02-10 01:02:52 -05:00
Al	ff245d74f8	[parser] building an index of postal codes and their valid admin contexts (city, state, country, etc.) during training e.g. "11216" => ["brooklyn", "ny"]. Postal code phrases like CP in Spanish are removed when constructing the index.	2017-02-10 00:50:48 -05:00
Al	1aacb5bccc	Merge branch 'master' into parser-data	2017-02-09 15:09:28 -05:00
Al	ea168279bd	[fix] free json-encoded string in parser client output	2017-02-09 14:34:15 -05:00
Al	38c6c26146	[fix] freeing normalized string in address_parser_parse	2017-02-09 14:33:13 -05:00
Al	8aa3749cfb	[utils] some convenience functions for generic hashtables (incr, get, etc)	2017-02-08 19:01:13 -05:00
Al	a6844c8ec1	[parser] structural changes for postal codes index	2017-02-08 18:52:45 -05:00
Al	6e4f641743	[phrases] adding token_phrase_memberships to trie_search for reuse	2017-02-08 01:59:39 -05:00
Al	ae35da8d17	[fix] uninitialized var	2017-02-08 01:58:53 -05:00

... 5 6 7 8 9 ...

1196 Commits