Commit Graph

4983 Commits

Author SHA1 Message Date
Al
6219cc6378 [numex] add dehyphenated form when building numex table 2017-04-05 14:06:19 -04:00
Al
264866d719 [build/fix] autoconf syntax for Ubuntu (12.04) version of autoconf aka that used on Travis 2017-04-05 09:43:24 -04:00
Al
ef0d4c2ded [build] fixing checks in numex.py, run when the resources/numex directory changes 2017-04-05 08:53:48 -04:00
Al
0ec2e57afa [fix] adding yaml to requirements-simple.txt for CI 2017-04-05 08:33:39 -04:00
Al
64fae1e241 [fix] /AC_CONFIG_MACRO_DIRS/AC_CONFIG_MACRO_DIR/ 2017-04-05 08:27:44 -04:00
Al
2b3fb196a1 [build] add pkg-config to packages in Travis config, remove libsnappy-dev 2017-04-05 08:24:26 -04:00
Al
8cef3c4eb9 [docs] new parser GIF, featuring addresses relevant to current events 2017-04-05 07:21:48 -04:00
Al
aaae1e055e [docs] fix spacing 2017-04-05 02:03:39 -04:00
Al
9c7eac61eb [docs] merge README from master, move bindings below examples 2017-04-05 02:02:59 -04:00
Al
8ec6e546f5 [test] adding more tests from the demo 2017-04-04 20:52:28 -04:00
Al
22443e31cc [parser] removing special commands other than .exit from address_parser_cli 2017-04-04 20:49:37 -04:00
Al
8742574257 [parser] storing address_parser_context on the parser struct itself so it doesn't have to be allocated every time 2017-04-04 20:40:55 -04:00
Al
67157fbd98 [docs] moving blog post to first paragraph 2017-04-03 21:04:37 -04:00
Al
b8f65d0a06 [docs] aesthetic README changes 2017-04-03 18:18:02 -04:00
Al
f746c6eec6 [openaddresses] Sampson and Yadkin counties, NC, and Union County, SC 2017-04-03 18:08:55 -04:00
Al
bca449e653 [openaddresses] Rown County, NC 2017-04-03 17:57:03 -04:00
Al
6102fd3459 [openaddresses] Carteret County, NC 2017-04-03 16:55:21 -04:00
Al
342740c3a6 [openaddresses] Bladen County, NC 2017-04-03 16:53:43 -04:00
Al
7c67ca6edb [openaddresses] Beaufort County, NC 2017-04-03 16:52:15 -04:00
Al
680a2e6357 [openaddresses] city of Ruidoso, NM 2017-04-03 16:50:27 -04:00
Al
921e635b7a [openaddresses] add Caddo Parisn, LA 2017-04-03 16:48:30 -04:00
Al
e0dc0c9b86 [openaddresses] add Desoto County, FL 2017-04-03 16:45:56 -04:00
Al
20adc591a8 [openaddresses] adding OSM boundaries to Clear Creek County, CO as new data set doesn't list city 2017-04-03 16:38:53 -04:00
Al
4b16b5bccd [docs] README fixes 2017-04-03 16:35:48 -04:00
Al
97ffdbaee0 [openaddresses] removing Lawrence County, SD. Covered by new statewide and has some weird addresses 2017-04-03 16:16:52 -04:00
Al
e4290a489f [openaddresses] Fall River County, SD 2017-04-03 16:15:21 -04:00
Al
c3a6445290 [docs] README updates for 1.0 release, adding training data section 2017-04-03 15:59:01 -04:00
Al
65a0d82bda [openaddresses] moving Buenos Aires, adding Boulder County, CO 2017-04-03 13:08:34 -04:00
Al
eff7a7a27a [optimization] moving regularization methods to their own module 2017-04-03 00:16:30 -04:00
Al
957aa0c0c9 [utils] cartesian product iterator for grid search during model selection 2017-04-03 00:15:31 -04:00
Al
4a72afc712 [build] Makefile changes for new language_classifier_train 2017-04-02 23:55:31 -04:00
Al
378a11c88f [fix] expansion array destroy API in libpostal expand program 2017-04-02 23:55:04 -04:00
Al
c5e2f89ee9 [fix] declaring is_common_script function as static 2017-04-02 23:53:21 -04:00
Al
5dfdd4b7eb [language_classification] Runtime language classifier can now use dense or sparse weights, with a different header signature for the sparse version (using old signature for the dense version, so backward-compatible) 2017-04-02 23:51:54 -04:00
Al
835d851310 [log] log the offending line if token count does not match in language_classifier_io 2017-04-02 23:47:07 -04:00
Al
964ac15e51 [language_classification] adding options to language_classifier_train for using SGD with {L2, L1} regularization or FTRL-Proximal using both.
1. Creates sparse matrix for L1 SGD and FTRL
    2. Uses the one standard-error rule during cross-validation.
    Parameters within one standard error of the lowest-cost solution
    are preferred if they are better regularized.
    3. Pulls weights matrix for only the features that occurred
    in a given batch. In the case of FTRL, this needs to be computed
    each on each batch, so the sparsity helps here.
2017-04-02 23:46:14 -04:00
Al
58661c9f27 [languages] adding replace_hyphens and split_alpha_from_numeric in language classifier input normalization 2017-04-02 23:32:24 -04:00
Al
e4ed759f0d [math] using new matrix methods in softmax 2017-04-02 23:29:52 -04:00
Al
3aab15a0a0 [math] adding mean, variance and standard deviation to generic vector functions 2017-04-02 23:29:15 -04:00
Al
3cb513a8f2 [utils] hash_get is no longer a string-only function, can be used for generic hashtables 2017-04-02 23:28:17 -04:00
Al
95e39ad91c [utils] removing default chunk size from address_parser_train 2017-04-02 23:26:51 -04:00
Al
a4431dbb27 [classification] removing regularization update from gradient computation in logistic regression, as that's now handled by the optimizer 2017-04-02 14:32:14 -04:00
Al
64c049730a [classification] flexible logistic regression trainer that can handle either SGD (with either L1 or L2) or FTRL as optimiers 2017-04-02 14:30:14 -04:00
Al
cf88bc7f65 [optimization] implemented Google's FTRL-Proximal, adapted for the multiclass/multinomial case. It is L1 and L2 regularized, and should both encourage sparsity with the L1 penalty while being robust to collinearity of features due to the L2 penalty. Ref: https://research.google.com/pubs/archive/41159.pdf 2017-04-02 14:28:25 -04:00
Al
ed05aaabb1 [utils] adding default chunk size to shuffle.h 2017-04-02 13:51:45 -04:00
Al
96e1ca5e89 [utils] sparse_matrix_add_unique_columns_alias, adds the actual column indices to hashtable/array and aliases those in the table from 1 to N (where N is the number of unique columns in this batch). This way it's compatible with smaller matrices of batch weights. 2017-04-02 13:48:46 -04:00
Al
a2563a4dcd [optimization] new sgd_trainer struct to manage weights in stochastic gradient descent, allows L1 or L2 regularization, cumulative penalties instead of exponential decay, SGD using L1 regularization encouraged sparsity and can produce a sparse matrix after training rather than a dense one 2017-04-02 13:44:59 -04:00
Al
19fe084974 [utils] adding non-branching sign functions 2017-04-02 13:41:57 -04:00
Al
74a281e332 [dictionaries] more abbreviations for MLK 2017-04-01 00:54:14 -04:00
Al
7f30fb8e38 [openaddresses] add OSM boundaries to King, NC 2017-03-31 21:13:32 -04:00