libpostal

Author	SHA1	Message	Date
Iestyn Pryce	73d27caeb9	Fix log_* formats which expect long long uint but receive uint64_t.	2017-05-21 10:57:20 +01:00
Al	4c03e563e0	[parser] for the min updates method to work, the feature that have not yet reached the min_updates threshold also need to be ignored when scoring, that way the model has to perform without those features, and should make more updates if they're relevant	2017-03-08 15:40:12 -05:00
Al	95015990ab	[parser] learning a sparser averaged perceptron model for the parser using the following method: - store a vector of update counts for each feature in the model - when the model updates after making a mistake, increment the update counters for the observed features in that example - after the model is finished training, keep only the features that participated in a minimum number of updates This method is described in greater detail in this paper from Yoav Goldberg: https://www.cs.bgu.ac.il/~yoavg/publications/acl2011sparse.pdf The authors there report a 4x size reduction at only a trivial cost in terms of accuracy. So far the trials on libpostal indicate roughly the same, though at lower training set sizes the accuracy cost is greater. This method is more effective than simple feature pruning as feature pruning methods are usually based on the frequency of the feature in the training set, and infrequent features can still be important. However, the perceptron's early iterations make many updates on irrelevant featuers simply because the weights for the more relevant features aren't tuned yet. The number of updates a feature participates in can be seen as a measure of its relevance to classifying examples. This commit introduces --min-features option to address_parser_train (default=5), so it can effectively be turned off by using "--min-features 0" or "--min-features 1".	2017-03-06 22:28:33 -05:00
Al	cc58ec9db2	[parser] fix another valgrind error in parser training (cstring_array memory can get moved around when using string pointers obtained before adding to it, which can potentially cause a realloc), no longer using the dummy START tags as the feature function can choose to add features for those cases	2017-03-06 21:39:14 -05:00
Al	8ea5405c20	[parser] using separate arrays for features requiring tag history and making the tagger responsible for those features so the feature function does not require passing in prev and prev2 explicitly (i.e. don't need to run the feature function multiple times if using global best-sequence prediction)	2017-02-19 14:21:58 -08:00
Al	b320aed9ac	[merge] merging master	2017-01-13 19:58:49 -05:00
Al	df89387b5c	[fix] calloc instead of malloc when performing initialization on structs that may fail halfway and need to clean up while partially initialized (calloc will set all the bytes to zero so the member pointers are NULL instead of garbage memory)	2017-01-13 18:30:04 -05:00
Al	0e29cdd9fd	[parser] fixing some uninitialized value issues during parser training	2016-11-30 15:42:09 -08:00
Al	50a36cc595	[parser] using trie_new_from_hash instead of an inline implemention in averaged perceptron training	2015-10-04 18:31:16 -04:00
Al	8ca22247f9	[fix] labels in averaged perceptron trainer	2015-09-29 13:07:07 -04:00
Al	8a86f7ec64	[parser] Adding context struct to feature function	2015-09-17 05:48:00 -04:00
Al	9de3029dd3	[parser] Averaged perceptron training does full examples (greedily). During training, features are a hashtable, sorted and converted to a trie during finalize	2015-09-14 17:38:45 -04:00
Al	6a5b01b51b	[parser] Averaged perceptron training	2015-09-10 10:26:24 -07:00

13 Commits