libpostal

Author	SHA1	Message	Date
Al	2bda741fa9	[openaddresses] adding Sicily statewide	2017-03-20 21:22:49 -04:00
Al	67805047f4	[openaddresses] adding Novosibirsk Oblast, Russia	2017-03-20 19:05:45 -04:00
Al	b8a12e0517	[test] adding parser test cases in 22 countries. These may change, and I'm generlaly against putting every obscure test case in the world in here. It's better to measure accuracy in aggregate statistics instead of individual test cases (i.e. if a particular change to the parser improves overall performance but fails one test case, should we accept the improvement?) The thought here is: these represent parses that are used in documentation/examples, as well as most of those that have been brought up in Github issues from the initial release, and we want these specific tests to work from build to build. If a model fails one of these test cases, it shouldn't get pushed to our users.	2017-03-20 00:58:52 -04:00
Al	7218ca1316	[openaddresses] adding Chesterfield, SC	2017-03-19 16:10:29 -04:00
Al	3b9b43f1b5	[fix] handle multiple separators (like parens used in https://www.openstreetmap.org/node/244081449 ). Creates bad trie entries otherwise, which affect more than just that toponym	2017-03-18 06:09:52 -04:00
Al	c67678087f	[parser] using a bipartite graph (indptr + indices) to represent postal code<=>admin relationships instead of a set of 64-bit ints. Requires \|V(postal codes)\| + \|E\| 32 bit ints instead of \|E\| 64 bit ints. Saves several hundred MB in file size and even more space in memory because of the hashtable overhead	2017-03-18 06:05:28 -04:00
Al	cb112f0ea7	[transliteration] regenerate transliteration data	2017-03-17 18:28:41 -04:00
Al	579425049b	[fix] with the new CLDR transform format, reverse the lines rather than the nodes in reverse transliterators	2017-03-17 18:28:15 -04:00
Al	8e3c9d0269	[test] adding test of new latin-ascii-simple transliterator which only handles things like HTML entities	2017-03-17 18:27:18 -04:00
Al	be07bfe35d	[test] adding printfs on expansion test failure so it's more clear what's going on	2017-03-17 17:46:22 -04:00
Al	dfabd25e5d	[phrases] set node data only when we're sure we have a correct match, otherwise the longer phrase may actually be matched	2017-03-17 03:40:29 -04:00
Al	f4a9e9d673	[fix] don't compare a double to 0	2017-03-15 14:59:33 -04:00
Al	266065f22f	[fix] need to store stats for component phrases that have more than one component, otherwise only the first gets stored and everything is an "unambiguous" phrase, which is not true	2017-03-15 14:11:59 -04:00
Al	0b27eb3f74	[parser] thought numeric boundary names had already been removed in the source data, but someehow they've made it into one of the data sets. Doing a final check in context_fill for valid boundary names (currently valid if there's at least one non-digit token)	2017-03-15 13:07:21 -04:00
Al	1b2696b3b5	[utils] adding string_is_digit function, similar to Python\'s (i.e. counts if it's in the Nd unicode category)	2017-03-15 13:04:39 -04:00
Al	1a1f0a44d2	[parser] parser only inserts spaces in the output if there were spaces (or other ignorable tokens) in the normalized input	2017-03-15 03:35:03 -04:00
Al	d43989cf1c	[fix] log_sum_exp in SSE mode shouldn't modify the original array	2017-03-15 00:22:17 -04:00
Al	c201939f3a	[openaddresses] adding some of the new counties in GA. Adding the simple unit regex to DeKalb county's ignore list as there are a few in there	2017-03-13 16:08:37 -04:00
Al	e0a9171c09	[openaddresses] adding language-delineated files for South Tyrol	2017-03-13 01:27:23 -04:00
Al	6cf113b1df	[fix] handle case of T = 0 in Viterbi decoding	2017-03-12 22:55:52 -04:00
Al	35ccb3ee62	[fix] move	2017-03-12 20:22:35 -04:00
Al	d40a355d8b	[fix] heap issues when cleaning up CRF	2017-03-12 20:20:51 -04:00
Al	1277f82f52	[logging] some small logging changes to track vocab pre/post pruning	2017-03-12 00:24:52 -05:00
Al	7afba832e5	[test] adding the new tests to the Makefile	2017-03-11 14:34:27 -05:00
Al	7562cf866b	[crf] in averaged perceptron training for the CRF, need to update transition features when either guess != truth or prev_guess != prev_truth	2017-03-11 05:58:23 -05:00
Al	a6eaf5ebc5	[fix] had taken out a previous optimization while debugging. Don't need to repeatedly update the backpointer array in viterbi to store an argmax when a stack variable will work. Because that's in the quadratic (only in L, the number o labels, which is small) section of the algorithm, even this small change can make a pretty sizeable difference. CRF training speed is now roughly on par with the greedy model	2017-03-11 02:31:52 -05:00
Al	647ddf171d	[fix] formatting for print features in CRF model	2017-03-11 01:18:35 -05:00
Al	735fd7a6b7	[openaddresses] adding Douglas County, OR	2017-03-10 23:36:11 -05:00
Al	d876beb386	[fix] add CRF files to the main lib	2017-03-10 19:40:15 -05:00
Al	0ec590916b	[build] adding necessary sources to address_parser client, address_parser_train and address_parser_test	2017-03-10 19:33:31 -05:00
Al	25649f2122	[utils] new_fixed and resize_fixed in vector.h	2017-03-10 19:31:34 -05:00
Al	4e02a54a79	[utils] adding file_exists to header	2017-03-10 19:30:46 -05:00
Al	5775e3d806	[parser/cli] removing geodb loading from parser client	2017-03-10 19:30:18 -05:00
Al	8deb1716cb	[parser] adding polymorphic (as much as C does polymorphism) model type for the parser to allow it to handle either the greedy averaged perceptron or a CRF. During training, saving, and loading, we use a different filename for a parser trained with a CRF, which is still backward-compatible with models previously trained in parser-data. Making necessary modifications to address_parser.c, address_parser_train.c, and address_parser_test.c. Also adding an option in address_parser_test to print individual errors in addition to the confusion matrix.	2017-03-10 19:28:21 -05:00
Al	1bd4689c5f	[openaddresses] add Gillespie County, TX	2017-03-10 15:53:49 -05:00
Al	171aa77ea3	[openaddresses] adding Fisher County, TX	2017-03-10 15:46:46 -05:00
Al	8e3bcbfc95	[openaddresses] adding Coffey County, KS	2017-03-10 15:44:59 -05:00
Al	b85ed70674	[utils] adding a function for checking if files exists (yay C), or at least the closest agreed-upon method for it (may return false if the user doesn't have permissions, but that's ok for our purposes here)	2017-03-10 13:39:52 -05:00
Al	3b33325c1a	[cli] no longer need geodb setup in address parser client	2017-03-10 13:11:32 -05:00
Al	ef8768281b	[parser/crf] adding runtime CRF tagger, which can be loaded/used once trained. Currently only does Viterbi inference, can add top-N and/or sequence probabilities later	2017-03-10 02:06:45 -05:00
Al	9afff5c9ed	[parser/crf] adding an initial training algorithm for CRFs, the averaged perceptron (FTW!) Though it does not generate scores suitable for use as probabilties, and might achieve slightly lower accuracy on some tasks than its gradient-based counterparts like SGD (a possibility for libpostal) or LBFGS (prohibitive on this much data), the averaged perceptron is appealing for two reasons: speed and low memory usage i.e. we can still use all the same tricks as in the greedy model like sparse construction of the weight matrix. In this case we can go even sparser than in the original because the state-transition features are separate from the state features, and we need to be able to iterate over all of them instead of simply creating new string keys in the feature space. The solution to this is quite simple: we simply treat the weights for each state-transition feature as if they have L * L output labels instead of simply L. So instead of: { "prev\|road\|word\|DD": {1: 1.0, 2: -1.0} ... } We'd have: { "word\|DD": {(0, 1): 1.0, (0, 2): -1.0} ... } As usual we compress the features to a trie, and the weights to compressed-sparse row (CSR) format sparse matrix after the weights have been averaged. These representations are smaller, faster to load from disk, and faster to use at runtime (contiguous arrays vs hashtables). This also includes the min_updates variation from the greedy perceptron, so features that participate in fewer than N updates are discarded at the end (and also not used in scoring until they meet the threshold so the model doesn't become dependent on features it doesn't really have). This tends to discard irrelevant features, keeping the model small without hurting accuracy much (within a tenth of a percent or so in my tests on the greedy perceptron).	2017-03-10 01:28:31 -05:00
Al	5cac4a7585	[parser/crf] adding crf_trainer, which can be thought of as a "base class" as much as that's possible in C, for creating trainers for the CRF. It doesn't deal with the weights or their representation, just provides an interface for keeping track of string features and label names, and holds the crf_context	2017-03-10 01:25:20 -05:00
Al	dd0bead63a	[test/utils] also a good thing to sanity check (in C especially): string handling code	2017-03-10 01:15:23 -05:00
Al	adab8ab51a	[test/crf] test for crf_context, adapted from crf1dc_debug_context in CRFsuite. Always a good idea to sanity check numerical code	2017-03-10 01:13:40 -05:00
Al	f9a9dc2224	[parser/crf] adding the beginnings of a linear-chain Conditional Random Field implementation for the address parser. One of the main issues with the greedy averaged perceptron tagger used currently in libpostal is that it predicts left-to-right and commits to its answers i.e. doesn't revise its previous predictions. The model can use its own previous predictions to classify the current word, but effectively it makes the best local decision it can and never looks back (the YOLO approach to parsing). This can be problematic in a multilingual setting like libpostal, since the order of address components is language/country dependent. It would be preferable to have a model that scores whole _sequences_ instead of individual tagging decisions. That's exactly what a Conditional Random Field (CRF) does. Instead of modeling P(y_i\|x_i, y_i-1), we're modeling P(y\|x) where y is the whole sequence of labels and x is the whole sequence of features. They achieve state-of-the-art results in many tasks (or are a component in the state-of-the-art model - LSTM-CRFs have been an interesting direction along these lines). The crf_context module is heavily borrowed from the version in CRFSuite (https://github.com/chokkan/crfsuite) though using libpostal's data structures and allowing for "state-transition features." CRFSuite has state features like "word=the", and transition features i.e. "prev tag=house", but no notion of a feature which incorporates both local and transition information e.g. "word=the and prev tag=house". These types of features are useful in our setting where there are many languages and it might not make as much sense to simply have a weight for "house_number => road" because that highly depends on the country. This implementation introduces a T x L^2 matrix for those state-transition scores. For linear-chain CRFs, the Viterbi algorithm is used for computing the most probable sequence. There are versions of Viterbi for computing the N most probable sequences as well, which may come in handy later. This can also compute marginal probabilities of a sequence (though it would need to wait until a gradient-based learning method that produces well-calibrated probabilities is implemented). The cool thing architecturally about crf_context as a separate module is that the weights can be learned through any method we want. As long as the state scores, state-transition scores, and transition scores are populated on the context struct, we have everything we need to run Viterbi inference, etc. without really caring about which training algorithm was used to optimize the weights, what the features are, how they're stored, etc. So far the results have been very encouraging. While it is slower to train a linear-chain CRF, and it will likely add several days to the training process, it's still reasonably fast at runtime and not all that slow at training time. In unscientific tests on a busy MacBook Pro, so far training has been chunking through ~3k addresses / sec, which is only about half the speed of the greedy tagger (haven't benchmarked the runtime difference but anecdotally it's hardly noticeable). Libpostal training runs considerably faster on Linux with gcc, so 3k might be a little low. I'd also guess that re-computing features every iteration means there's a limit on the performance of the greedy tagger. The differences might be more pronounced if features were pre-computed (a possible optimization).	2017-03-10 01:10:22 -05:00
Al	f9e60b13f5	[parser] size the postcode context set appropriately when reading the parser, makes loading a large model much faster	2017-03-09 14:31:12 -05:00
Al	2400122162	[fix] fixing up hash str to id template	2017-03-09 00:54:31 -05:00
Al	4c03e563e0	[parser] for the min updates method to work, the feature that have not yet reached the min_updates threshold also need to be ignored when scoring, that way the model has to perform without those features, and should make more updates if they're relevant	2017-03-08 15:40:12 -05:00
Al	a63c182e96	[parser] right context affixes need to use pre-normalized words as well	2017-03-08 13:51:36 -05:00
Al	ce9153d94d	[parser] fixing some issues in address_parser_features. Prefix/suffix phrases use the word before token-level normalization (but after string-level normalization like lowercasing), needed to use the same string in the feature function as in address_parser_context_fill. Affects some German suffixes like "str." where the final "." would be deleted in token normalization, but the suffix length would include it. Also, three of the new arrays used in address_parser_context (suffix_phrases, prefix_phrases, and sub_tokens) weren't being cleared per call, which means computing the wrong features at best and a segfault at worst	2017-03-07 17:30:53 -05:00

... 2 3 4 5 6 ...

4994 Commits