Commit Graph

5025 Commits

Author SHA1 Message Date
Al
9c7eac61eb [docs] merge README from master, move bindings below examples 2017-04-05 02:02:59 -04:00
Al
8ec6e546f5 [test] adding more tests from the demo 2017-04-04 20:52:28 -04:00
Al
22443e31cc [parser] removing special commands other than .exit from address_parser_cli 2017-04-04 20:49:37 -04:00
Al
8742574257 [parser] storing address_parser_context on the parser struct itself so it doesn't have to be allocated every time 2017-04-04 20:40:55 -04:00
Al
67157fbd98 [docs] moving blog post to first paragraph 2017-04-03 21:04:37 -04:00
Al
b8f65d0a06 [docs] aesthetic README changes 2017-04-03 18:18:02 -04:00
Al
f746c6eec6 [openaddresses] Sampson and Yadkin counties, NC, and Union County, SC 2017-04-03 18:08:55 -04:00
Al
bca449e653 [openaddresses] Rown County, NC 2017-04-03 17:57:03 -04:00
Al
6102fd3459 [openaddresses] Carteret County, NC 2017-04-03 16:55:21 -04:00
Al
342740c3a6 [openaddresses] Bladen County, NC 2017-04-03 16:53:43 -04:00
Al
7c67ca6edb [openaddresses] Beaufort County, NC 2017-04-03 16:52:15 -04:00
Al
680a2e6357 [openaddresses] city of Ruidoso, NM 2017-04-03 16:50:27 -04:00
Al
921e635b7a [openaddresses] add Caddo Parisn, LA 2017-04-03 16:48:30 -04:00
Al
e0dc0c9b86 [openaddresses] add Desoto County, FL 2017-04-03 16:45:56 -04:00
Al
20adc591a8 [openaddresses] adding OSM boundaries to Clear Creek County, CO as new data set doesn't list city 2017-04-03 16:38:53 -04:00
Al
4b16b5bccd [docs] README fixes 2017-04-03 16:35:48 -04:00
Al
97ffdbaee0 [openaddresses] removing Lawrence County, SD. Covered by new statewide and has some weird addresses 2017-04-03 16:16:52 -04:00
Al
e4290a489f [openaddresses] Fall River County, SD 2017-04-03 16:15:21 -04:00
Al
c3a6445290 [docs] README updates for 1.0 release, adding training data section 2017-04-03 15:59:01 -04:00
Al
65a0d82bda [openaddresses] moving Buenos Aires, adding Boulder County, CO 2017-04-03 13:08:34 -04:00
Al
eff7a7a27a [optimization] moving regularization methods to their own module 2017-04-03 00:16:30 -04:00
Al
957aa0c0c9 [utils] cartesian product iterator for grid search during model selection 2017-04-03 00:15:31 -04:00
Al
4a72afc712 [build] Makefile changes for new language_classifier_train 2017-04-02 23:55:31 -04:00
Al
378a11c88f [fix] expansion array destroy API in libpostal expand program 2017-04-02 23:55:04 -04:00
Al
c5e2f89ee9 [fix] declaring is_common_script function as static 2017-04-02 23:53:21 -04:00
Al
5dfdd4b7eb [language_classification] Runtime language classifier can now use dense or sparse weights, with a different header signature for the sparse version (using old signature for the dense version, so backward-compatible) 2017-04-02 23:51:54 -04:00
Al
835d851310 [log] log the offending line if token count does not match in language_classifier_io 2017-04-02 23:47:07 -04:00
Al
964ac15e51 [language_classification] adding options to language_classifier_train for using SGD with {L2, L1} regularization or FTRL-Proximal using both.
1. Creates sparse matrix for L1 SGD and FTRL
    2. Uses the one standard-error rule during cross-validation.
    Parameters within one standard error of the lowest-cost solution
    are preferred if they are better regularized.
    3. Pulls weights matrix for only the features that occurred
    in a given batch. In the case of FTRL, this needs to be computed
    each on each batch, so the sparsity helps here.
2017-04-02 23:46:14 -04:00
Al
58661c9f27 [languages] adding replace_hyphens and split_alpha_from_numeric in language classifier input normalization 2017-04-02 23:32:24 -04:00
Al
e4ed759f0d [math] using new matrix methods in softmax 2017-04-02 23:29:52 -04:00
Al
3aab15a0a0 [math] adding mean, variance and standard deviation to generic vector functions 2017-04-02 23:29:15 -04:00
Al
3cb513a8f2 [utils] hash_get is no longer a string-only function, can be used for generic hashtables 2017-04-02 23:28:17 -04:00
Al
95e39ad91c [utils] removing default chunk size from address_parser_train 2017-04-02 23:26:51 -04:00
Al
a4431dbb27 [classification] removing regularization update from gradient computation in logistic regression, as that's now handled by the optimizer 2017-04-02 14:32:14 -04:00
Al
64c049730a [classification] flexible logistic regression trainer that can handle either SGD (with either L1 or L2) or FTRL as optimiers 2017-04-02 14:30:14 -04:00
Al
cf88bc7f65 [optimization] implemented Google's FTRL-Proximal, adapted for the multiclass/multinomial case. It is L1 and L2 regularized, and should both encourage sparsity with the L1 penalty while being robust to collinearity of features due to the L2 penalty. Ref: https://research.google.com/pubs/archive/41159.pdf 2017-04-02 14:28:25 -04:00
Al
ed05aaabb1 [utils] adding default chunk size to shuffle.h 2017-04-02 13:51:45 -04:00
Al
96e1ca5e89 [utils] sparse_matrix_add_unique_columns_alias, adds the actual column indices to hashtable/array and aliases those in the table from 1 to N (where N is the number of unique columns in this batch). This way it's compatible with smaller matrices of batch weights. 2017-04-02 13:48:46 -04:00
Al
a2563a4dcd [optimization] new sgd_trainer struct to manage weights in stochastic gradient descent, allows L1 or L2 regularization, cumulative penalties instead of exponential decay, SGD using L1 regularization encouraged sparsity and can produce a sparse matrix after training rather than a dense one 2017-04-02 13:44:59 -04:00
Al
19fe084974 [utils] adding non-branching sign functions 2017-04-02 13:41:57 -04:00
Al
74a281e332 [dictionaries] more abbreviations for MLK 2017-04-01 00:54:14 -04:00
Al
7f30fb8e38 [openaddresses] add OSM boundaries to King, NC 2017-03-31 21:13:32 -04:00
Al
b52f137b5d [openaddresses] adding units to Chelan County, WA, adding Island County, WA 2017-03-31 18:08:43 -04:00
Al
6ec4c1fdc9 [openaddresses] adding units to city of Columbia, MO 2017-03-31 17:44:04 -04:00
Al
f349607412 [openaddresses] adding units in Boone County, MO 2017-03-31 17:27:35 -04:00
Al
bd8de15886 [openaddresses] OSM boundaries no longer needed in Alamance County, NC. Ignore city when it's {ALAMANCECOUNTY, COUNTY} 2017-03-31 17:24:45 -04:00
Al
267be6c05c [data] 12 worker pool in data download instead of 10 to download the new parser in one shot 2017-03-31 15:52:17 -04:00
Al
7f8c2f0ad3 [fix] remove bloom.c from libpostal sources 2017-03-31 15:22:48 -04:00
Al
a64c81b45b [data/models] updating libpostal download script to download new models. The simple data files are stored by libpostal major version, whereas the models are stored by the version of the training data they used. A file called "latest" is stored in S3 to indicate the latest version of the model and checked on make 2017-03-31 13:35:07 -04:00
Al
6d4c7984df [api] doing this now since we're bumping a major version. Using a libpostal prefixes for all public header functions and definitions 2017-03-31 03:35:51 -04:00