Commit Graph

31 Commits

Author SHA1 Message Date
Iestyn Pryce
87cf7b5bca Add portable way of formatting khint_t type (from klib) 2017-05-21 11:58:37 +01:00
Iestyn Pryce
73d27caeb9 Fix log_* formats which expect long long uint but receive uint64_t. 2017-05-21 10:57:20 +01:00
Austin Chu
f9b57dbd42 [fix] don't use unnamed fields in initializers
GCC did not support assigning to unnamed fields from designated
initializers until 4.6 [1]. Unfortunately, CentOS 6 ships with GCC 4.4,
so avoiding this C99 feature is necessary to fix building in CentOS 6
environments.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10676
2017-04-13 14:44:20 -04:00
Al
8742574257 [parser] storing address_parser_context on the parser struct itself so it doesn't have to be allocated every time 2017-04-04 20:40:55 -04:00
Al
95e39ad91c [utils] removing default chunk size from address_parser_train 2017-04-02 23:26:51 -04:00
Al
6d4c7984df [api] doing this now since we're bumping a major version. Using a libpostal prefixes for all public header functions and definitions 2017-03-31 03:35:51 -04:00
Al
c67678087f [parser] using a bipartite graph (indptr + indices) to represent postal code<=>admin relationships instead of a set of 64-bit ints. Requires |V(postal codes)| + |E| 32 bit ints instead of |E| 64 bit ints. Saves several hundred MB in file size and even more space in memory because of the hashtable overhead 2017-03-18 06:05:28 -04:00
Al
266065f22f [fix] need to store stats for component phrases that have more than one component, otherwise only the first gets stored and everything is an "unambiguous" phrase, which is not true 2017-03-15 14:11:59 -04:00
Al
1277f82f52 [logging] some small logging changes to track vocab pre/post pruning 2017-03-12 00:24:52 -05:00
Al
8deb1716cb [parser] adding polymorphic (as much as C does polymorphism) model type for the parser to allow it to handle either the greedy averaged perceptron or a CRF. During training, saving, and loading, we use a different filename for a parser trained with a CRF, which is still backward-compatible with models previously trained in parser-data. Making necessary modifications to address_parser.c, address_parser_train.c, and address_parser_test.c. Also adding an option in address_parser_test to print individual errors in addition to the confusion matrix. 2017-03-10 19:28:21 -05:00
Al
95015990ab [parser] learning a sparser averaged perceptron model for the parser using the following method:
- store a vector of update counts for each feature in the model
- when the model updates after making a mistake, increment the update
  counters for the observed features in that example
- after the model is finished training, keep only the features that
  participated in a minimum number of updates

This method is described in greater detail in this paper from Yoav
Goldberg: https://www.cs.bgu.ac.il/~yoavg/publications/acl2011sparse.pdf

The authors there report a 4x size reduction at only a trivial cost in
terms of accuracy. So far the trials on libpostal indicate roughly the
same, though at lower training set sizes the accuracy cost is greater.

This method is more effective than simple feature pruning as feature
pruning methods are usually based on the frequency of the feature
in the training set, and infrequent features can still be important.
However, the perceptron's early iterations make many updates on
irrelevant featuers simply because the weights for the more relevant
features aren't tuned yet. The number of updates a feature participates
in can be seen as a measure of its relevance to classifying examples.

This commit introduces --min-features option to address_parser_train
(default=5), so it can effectively be turned off by using
"--min-features 0" or "--min-features 1".
2017-03-06 22:28:33 -05:00
Al
c3581557a1 [parser] counting classes instead of keeping a set 2017-03-06 20:05:01 -05:00
Al
39fa8ff1a5 [parser] counting num classes in address parser init for models where it is needed a priori 2017-03-06 15:17:52 -05:00
Al
5f19e63cbe [parser] more logging in init 2017-03-06 15:11:39 -05:00
Al
bb922e4ce4 [parser] adding log message 2017-03-06 12:25:22 -05:00
Al
0e49fc580a [parser] uint64_t chunk size, no warning if gshuf is available 2017-03-05 14:50:47 -05:00
Al
b76b7b8527 [parser] adding chunked shuffle as a C function (writes each line to one of n random files, runs shuf on each file and concatenates the result). Adding a version which allows specifying a specific chunk size, and using a 2GB limit for address parser training. Allowing gshuf again for Mac as it seems the only problem there was not having enough memory when testing on a Mac laptop. The new limited-memory version should be fast enough. 2017-03-05 02:15:11 -05:00
Al
182d60b623 [fix] removing include 2017-02-23 22:45:03 -05:00
Al
6a079e86b3 [fix] using size_t instead of int in address_parser/address_parser_train 2017-02-20 19:22:13 -08:00
Al
8ea5405c20 [parser] using separate arrays for features requiring tag history and making the tagger responsible for those features so the feature function does not require passing in prev and prev2 explicitly (i.e. don't need to run the feature function multiple times if using global best-sequence prediction) 2017-02-19 14:21:58 -08:00
Al
ba0ccc82a3 [fix] var name in address_parser_train 2017-02-15 22:22:33 -05:00
Al
ff245d74f8 [parser] building an index of postal codes and their valid admin contexts (city, state, country, etc.) during training e.g. "11216" => ["brooklyn", "ny"]. Postal code phrases like CP in Spanish are removed when constructing the index. 2017-02-10 00:50:48 -05:00
Al
174529e8d0 [parser] remove geodb and fix small memory leak in address_parser_train 2016-12-29 02:12:06 -05:00
Al
4677874610 [parser] stripping postal codes of phrases like CP (in Spanish) before adding them to the gazetteers, whether it's concatenated or a separate token. Adding a command-line argument for the number of iterations 2016-11-30 15:58:03 -08:00
Al
1b09b7f2e5 [fix] Adding country_region to address_parser_train 2016-07-28 16:18:32 -04:00
Al
44908ff95a [parser] No digit normalization in training data-derived parser phrases (for postcodes, etc.), phrases include the new island type, house number phrases if any are valid. Adjacent words are now full phrases if they are part of a multiword token like a city name. For hyphenated names like Carmel-by-the-Sea, adding a version to the phrase dictionary where the hyphens are replaced with spaces 2016-07-21 17:04:57 -04:00
Al
16501aba17 [fix] Need to load transliteration module for Latin-ASCII normalization 2016-07-21 17:04:57 -04:00
Al
6ef7c90278 [fix] using string_equals, handles NULLs 2016-01-05 14:08:10 -05:00
Al
24208c209f [parsing] Adding a training data derived index of complete phrases from suburb up to country. Only adding bias and word features for non phrases, using UNKNOWN_WORD and UNKNOWN_NUMERIC for infrequent tokens (not meeting minimum vocab count threshold). 2015-12-05 14:34:19 -05:00
Al
116fe857db [parser] gshuf (Mac equivalent of shuf) is quite a bit slower than shuf, so removing it. Need to train on Linux unless a better alternative is found for shuffling large files on Mac 2015-12-01 11:24:44 -05:00
Al
89677d94a3 [parsing] Initial commit of the address parser, training/testing, feature function, I/O 2015-11-30 14:48:13 -05:00