libpostal

Author	SHA1	Message	Date
Iestyn Pryce	87cf7b5bca	Add portable way of formatting khint_t type (from klib)	2017-05-21 11:58:37 +01:00
Iestyn Pryce	73d27caeb9	Fix log_* formats which expect long long uint but receive uint64_t.	2017-05-21 10:57:20 +01:00
Austin Chu	f9b57dbd42	[fix] don't use unnamed fields in initializers GCC did not support assigning to unnamed fields from designated initializers until 4.6 [1]. Unfortunately, CentOS 6 ships with GCC 4.4, so avoiding this C99 feature is necessary to fix building in CentOS 6 environments. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10676	2017-04-13 14:44:20 -04:00
Al	8742574257	[parser] storing address_parser_context on the parser struct itself so it doesn't have to be allocated every time	2017-04-04 20:40:55 -04:00
Al	95e39ad91c	[utils] removing default chunk size from address_parser_train	2017-04-02 23:26:51 -04:00
Al	6d4c7984df	[api] doing this now since we're bumping a major version. Using a libpostal prefixes for all public header functions and definitions	2017-03-31 03:35:51 -04:00
Al	c67678087f	[parser] using a bipartite graph (indptr + indices) to represent postal code<=>admin relationships instead of a set of 64-bit ints. Requires \|V(postal codes)\| + \|E\| 32 bit ints instead of \|E\| 64 bit ints. Saves several hundred MB in file size and even more space in memory because of the hashtable overhead	2017-03-18 06:05:28 -04:00
Al	266065f22f	[fix] need to store stats for component phrases that have more than one component, otherwise only the first gets stored and everything is an "unambiguous" phrase, which is not true	2017-03-15 14:11:59 -04:00
Al	1277f82f52	[logging] some small logging changes to track vocab pre/post pruning	2017-03-12 00:24:52 -05:00
Al	8deb1716cb	[parser] adding polymorphic (as much as C does polymorphism) model type for the parser to allow it to handle either the greedy averaged perceptron or a CRF. During training, saving, and loading, we use a different filename for a parser trained with a CRF, which is still backward-compatible with models previously trained in parser-data. Making necessary modifications to address_parser.c, address_parser_train.c, and address_parser_test.c. Also adding an option in address_parser_test to print individual errors in addition to the confusion matrix.	2017-03-10 19:28:21 -05:00
Al	95015990ab	[parser] learning a sparser averaged perceptron model for the parser using the following method: - store a vector of update counts for each feature in the model - when the model updates after making a mistake, increment the update counters for the observed features in that example - after the model is finished training, keep only the features that participated in a minimum number of updates This method is described in greater detail in this paper from Yoav Goldberg: https://www.cs.bgu.ac.il/~yoavg/publications/acl2011sparse.pdf The authors there report a 4x size reduction at only a trivial cost in terms of accuracy. So far the trials on libpostal indicate roughly the same, though at lower training set sizes the accuracy cost is greater. This method is more effective than simple feature pruning as feature pruning methods are usually based on the frequency of the feature in the training set, and infrequent features can still be important. However, the perceptron's early iterations make many updates on irrelevant featuers simply because the weights for the more relevant features aren't tuned yet. The number of updates a feature participates in can be seen as a measure of its relevance to classifying examples. This commit introduces --min-features option to address_parser_train (default=5), so it can effectively be turned off by using "--min-features 0" or "--min-features 1".	2017-03-06 22:28:33 -05:00
Al	c3581557a1	[parser] counting classes instead of keeping a set	2017-03-06 20:05:01 -05:00
Al	39fa8ff1a5	[parser] counting num classes in address parser init for models where it is needed a priori	2017-03-06 15:17:52 -05:00
Al	5f19e63cbe	[parser] more logging in init	2017-03-06 15:11:39 -05:00
Al	bb922e4ce4	[parser] adding log message	2017-03-06 12:25:22 -05:00
Al	0e49fc580a	[parser] uint64_t chunk size, no warning if gshuf is available	2017-03-05 14:50:47 -05:00
Al	b76b7b8527	[parser] adding chunked shuffle as a C function (writes each line to one of n random files, runs shuf on each file and concatenates the result). Adding a version which allows specifying a specific chunk size, and using a 2GB limit for address parser training. Allowing gshuf again for Mac as it seems the only problem there was not having enough memory when testing on a Mac laptop. The new limited-memory version should be fast enough.	2017-03-05 02:15:11 -05:00
Al	182d60b623	[fix] removing include	2017-02-23 22:45:03 -05:00
Al	6a079e86b3	[fix] using size_t instead of int in address_parser/address_parser_train	2017-02-20 19:22:13 -08:00
Al	8ea5405c20	[parser] using separate arrays for features requiring tag history and making the tagger responsible for those features so the feature function does not require passing in prev and prev2 explicitly (i.e. don't need to run the feature function multiple times if using global best-sequence prediction)	2017-02-19 14:21:58 -08:00
Al	ba0ccc82a3	[fix] var name in address_parser_train	2017-02-15 22:22:33 -05:00
Al	ff245d74f8	[parser] building an index of postal codes and their valid admin contexts (city, state, country, etc.) during training e.g. "11216" => ["brooklyn", "ny"]. Postal code phrases like CP in Spanish are removed when constructing the index.	2017-02-10 00:50:48 -05:00
Al	174529e8d0	[parser] remove geodb and fix small memory leak in address_parser_train	2016-12-29 02:12:06 -05:00
Al	4677874610	[parser] stripping postal codes of phrases like CP (in Spanish) before adding them to the gazetteers, whether it's concatenated or a separate token. Adding a command-line argument for the number of iterations	2016-11-30 15:58:03 -08:00
Al	1b09b7f2e5	[fix] Adding country_region to address_parser_train	2016-07-28 16:18:32 -04:00
Al	44908ff95a	[parser] No digit normalization in training data-derived parser phrases (for postcodes, etc.), phrases include the new island type, house number phrases if any are valid. Adjacent words are now full phrases if they are part of a multiword token like a city name. For hyphenated names like Carmel-by-the-Sea, adding a version to the phrase dictionary where the hyphens are replaced with spaces	2016-07-21 17:04:57 -04:00
Al	16501aba17	[fix] Need to load transliteration module for Latin-ASCII normalization	2016-07-21 17:04:57 -04:00
Al	6ef7c90278	[fix] using string_equals, handles NULLs	2016-01-05 14:08:10 -05:00
Al	24208c209f	[parsing] Adding a training data derived index of complete phrases from suburb up to country. Only adding bias and word features for non phrases, using UNKNOWN_WORD and UNKNOWN_NUMERIC for infrequent tokens (not meeting minimum vocab count threshold).	2015-12-05 14:34:19 -05:00
Al	116fe857db	[parser] gshuf (Mac equivalent of shuf) is quite a bit slower than shuf, so removing it. Need to train on Linux unless a better alternative is found for shuffling large files on Mac	2015-12-01 11:24:44 -05:00
Al	89677d94a3	[parsing] Initial commit of the address parser, training/testing, feature function, I/O	2015-11-30 14:48:13 -05:00

31 Commits