libpostal

Author	SHA1	Message	Date
Al	eff7a7a27a	[optimization] moving regularization methods to their own module	2017-04-03 00:16:30 -04:00
Al	957aa0c0c9	[utils] cartesian product iterator for grid search during model selection	2017-04-03 00:15:31 -04:00
Al	4a72afc712	[build] Makefile changes for new language_classifier_train	2017-04-02 23:55:31 -04:00
Al	378a11c88f	[fix] expansion array destroy API in libpostal expand program	2017-04-02 23:55:04 -04:00
Al	c5e2f89ee9	[fix] declaring is_common_script function as static	2017-04-02 23:53:21 -04:00
Al	5dfdd4b7eb	[language_classification] Runtime language classifier can now use dense or sparse weights, with a different header signature for the sparse version (using old signature for the dense version, so backward-compatible)	2017-04-02 23:51:54 -04:00
Al	835d851310	[log] log the offending line if token count does not match in language_classifier_io	2017-04-02 23:47:07 -04:00
Al	964ac15e51	[language_classification] adding options to language_classifier_train for using SGD with {L2, L1} regularization or FTRL-Proximal using both. 1. Creates sparse matrix for L1 SGD and FTRL 2. Uses the one standard-error rule during cross-validation. Parameters within one standard error of the lowest-cost solution are preferred if they are better regularized. 3. Pulls weights matrix for only the features that occurred in a given batch. In the case of FTRL, this needs to be computed each on each batch, so the sparsity helps here.	2017-04-02 23:46:14 -04:00
Al	58661c9f27	[languages] adding replace_hyphens and split_alpha_from_numeric in language classifier input normalization	2017-04-02 23:32:24 -04:00
Al	e4ed759f0d	[math] using new matrix methods in softmax	2017-04-02 23:29:52 -04:00
Al	3aab15a0a0	[math] adding mean, variance and standard deviation to generic vector functions	2017-04-02 23:29:15 -04:00
Al	3cb513a8f2	[utils] hash_get is no longer a string-only function, can be used for generic hashtables	2017-04-02 23:28:17 -04:00
Al	95e39ad91c	[utils] removing default chunk size from address_parser_train	2017-04-02 23:26:51 -04:00
Al	a4431dbb27	[classification] removing regularization update from gradient computation in logistic regression, as that's now handled by the optimizer	2017-04-02 14:32:14 -04:00
Al	64c049730a	[classification] flexible logistic regression trainer that can handle either SGD (with either L1 or L2) or FTRL as optimiers	2017-04-02 14:30:14 -04:00
Al	cf88bc7f65	[optimization] implemented Google's FTRL-Proximal, adapted for the multiclass/multinomial case. It is L1 and L2 regularized, and should both encourage sparsity with the L1 penalty while being robust to collinearity of features due to the L2 penalty. Ref: https://research.google.com/pubs/archive/41159.pdf	2017-04-02 14:28:25 -04:00
Al	ed05aaabb1	[utils] adding default chunk size to shuffle.h	2017-04-02 13:51:45 -04:00
Al	96e1ca5e89	[utils] sparse_matrix_add_unique_columns_alias, adds the actual column indices to hashtable/array and aliases those in the table from 1 to N (where N is the number of unique columns in this batch). This way it's compatible with smaller matrices of batch weights.	2017-04-02 13:48:46 -04:00
Al	a2563a4dcd	[optimization] new sgd_trainer struct to manage weights in stochastic gradient descent, allows L1 or L2 regularization, cumulative penalties instead of exponential decay, SGD using L1 regularization encouraged sparsity and can produce a sparse matrix after training rather than a dense one	2017-04-02 13:44:59 -04:00
Al	19fe084974	[utils] adding non-branching sign functions	2017-04-02 13:41:57 -04:00
Al	267be6c05c	[data] 12 worker pool in data download instead of 10 to download the new parser in one shot	2017-03-31 15:52:17 -04:00
Al	7f8c2f0ad3	[fix] remove bloom.c from libpostal sources	2017-03-31 15:22:48 -04:00
Al	a64c81b45b	[data/models] updating libpostal download script to download new models. The simple data files are stored by libpostal major version, whereas the models are stored by the version of the training data they used. A file called "latest" is stored in S3 to indicate the latest version of the model and checked on make	2017-03-31 13:35:07 -04:00
Al	6d4c7984df	[api] doing this now since we're bumping a major version. Using a libpostal prefixes for all public header functions and definitions	2017-03-31 03:35:51 -04:00
Al	f7889bf138	[fix] removing WIP	2017-03-29 20:46:56 -04:00
Al	d757baaf56	[fix] HAVE_CBLAS in matrix.h, memcpy needs to use sizeof(type)	2017-03-29 19:01:13 -04:00
Al	a7c9b919e9	[build] trying a CBLAS-specific macro that doesn't rope in Fortran	2017-03-29 18:57:32 -04:00
Al	7fe84e6247	[matrix/utils] adding resize_fill_zeros	2017-03-21 01:37:08 -04:00
Al	7218ca1316	[openaddresses] adding Chesterfield, SC	2017-03-19 16:10:29 -04:00
Al	3b9b43f1b5	[fix] handle multiple separators (like parens used in https://www.openstreetmap.org/node/244081449 ). Creates bad trie entries otherwise, which affect more than just that toponym	2017-03-18 06:09:52 -04:00
Al	c67678087f	[parser] using a bipartite graph (indptr + indices) to represent postal code<=>admin relationships instead of a set of 64-bit ints. Requires \|V(postal codes)\| + \|E\| 32 bit ints instead of \|E\| 64 bit ints. Saves several hundred MB in file size and even more space in memory because of the hashtable overhead	2017-03-18 06:05:28 -04:00
Al	cb112f0ea7	[transliteration] regenerate transliteration data	2017-03-17 18:28:41 -04:00
Al	dfabd25e5d	[phrases] set node data only when we're sure we have a correct match, otherwise the longer phrase may actually be matched	2017-03-17 03:40:29 -04:00
Al	f4a9e9d673	[fix] don't compare a double to 0	2017-03-15 14:59:33 -04:00
Al	266065f22f	[fix] need to store stats for component phrases that have more than one component, otherwise only the first gets stored and everything is an "unambiguous" phrase, which is not true	2017-03-15 14:11:59 -04:00
Al	0b27eb3f74	[parser] thought numeric boundary names had already been removed in the source data, but someehow they've made it into one of the data sets. Doing a final check in context_fill for valid boundary names (currently valid if there's at least one non-digit token)	2017-03-15 13:07:21 -04:00
Al	1b2696b3b5	[utils] adding string_is_digit function, similar to Python\'s (i.e. counts if it's in the Nd unicode category)	2017-03-15 13:04:39 -04:00
Al	1a1f0a44d2	[parser] parser only inserts spaces in the output if there were spaces (or other ignorable tokens) in the normalized input	2017-03-15 03:35:03 -04:00
Al	d43989cf1c	[fix] log_sum_exp in SSE mode shouldn't modify the original array	2017-03-15 00:22:17 -04:00
Al	6cf113b1df	[fix] handle case of T = 0 in Viterbi decoding	2017-03-12 22:55:52 -04:00
Al	35ccb3ee62	[fix] move	2017-03-12 20:22:35 -04:00
Al	d40a355d8b	[fix] heap issues when cleaning up CRF	2017-03-12 20:20:51 -04:00
Al	1277f82f52	[logging] some small logging changes to track vocab pre/post pruning	2017-03-12 00:24:52 -05:00
Al	7562cf866b	[crf] in averaged perceptron training for the CRF, need to update transition features when either guess != truth or prev_guess != prev_truth	2017-03-11 05:58:23 -05:00
Al	a6eaf5ebc5	[fix] had taken out a previous optimization while debugging. Don't need to repeatedly update the backpointer array in viterbi to store an argmax when a stack variable will work. Because that's in the quadratic (only in L, the number o labels, which is small) section of the algorithm, even this small change can make a pretty sizeable difference. CRF training speed is now roughly on par with the greedy model	2017-03-11 02:31:52 -05:00
Al	647ddf171d	[fix] formatting for print features in CRF model	2017-03-11 01:18:35 -05:00
Al	d876beb386	[fix] add CRF files to the main lib	2017-03-10 19:40:15 -05:00
Al	0ec590916b	[build] adding necessary sources to address_parser client, address_parser_train and address_parser_test	2017-03-10 19:33:31 -05:00
Al	25649f2122	[utils] new_fixed and resize_fixed in vector.h	2017-03-10 19:31:34 -05:00
Al	4e02a54a79	[utils] adding file_exists to header	2017-03-10 19:30:46 -05:00

1 2 3 4 5 ...

1003 Commits