Al
8742574257
[parser] storing address_parser_context on the parser struct itself so it doesn't have to be allocated every time
2017-04-04 20:40:55 -04:00
Al
67157fbd98
[docs] moving blog post to first paragraph
2017-04-03 21:04:37 -04:00
Al
b8f65d0a06
[docs] aesthetic README changes
2017-04-03 18:18:02 -04:00
Al
f746c6eec6
[openaddresses] Sampson and Yadkin counties, NC, and Union County, SC
2017-04-03 18:08:55 -04:00
Al
bca449e653
[openaddresses] Rown County, NC
2017-04-03 17:57:03 -04:00
Al
6102fd3459
[openaddresses] Carteret County, NC
2017-04-03 16:55:21 -04:00
Al
342740c3a6
[openaddresses] Bladen County, NC
2017-04-03 16:53:43 -04:00
Al
7c67ca6edb
[openaddresses] Beaufort County, NC
2017-04-03 16:52:15 -04:00
Al
680a2e6357
[openaddresses] city of Ruidoso, NM
2017-04-03 16:50:27 -04:00
Al
921e635b7a
[openaddresses] add Caddo Parisn, LA
2017-04-03 16:48:30 -04:00
Al
e0dc0c9b86
[openaddresses] add Desoto County, FL
2017-04-03 16:45:56 -04:00
Al
20adc591a8
[openaddresses] adding OSM boundaries to Clear Creek County, CO as new data set doesn't list city
2017-04-03 16:38:53 -04:00
Al
4b16b5bccd
[docs] README fixes
2017-04-03 16:35:48 -04:00
Al
97ffdbaee0
[openaddresses] removing Lawrence County, SD. Covered by new statewide and has some weird addresses
2017-04-03 16:16:52 -04:00
Al
e4290a489f
[openaddresses] Fall River County, SD
2017-04-03 16:15:21 -04:00
Al
c3a6445290
[docs] README updates for 1.0 release, adding training data section
2017-04-03 15:59:01 -04:00
Al
65a0d82bda
[openaddresses] moving Buenos Aires, adding Boulder County, CO
2017-04-03 13:08:34 -04:00
Al
eff7a7a27a
[optimization] moving regularization methods to their own module
2017-04-03 00:16:30 -04:00
Al
957aa0c0c9
[utils] cartesian product iterator for grid search during model selection
2017-04-03 00:15:31 -04:00
Al
4a72afc712
[build] Makefile changes for new language_classifier_train
2017-04-02 23:55:31 -04:00
Al
378a11c88f
[fix] expansion array destroy API in libpostal expand program
2017-04-02 23:55:04 -04:00
Al
c5e2f89ee9
[fix] declaring is_common_script function as static
2017-04-02 23:53:21 -04:00
Al
5dfdd4b7eb
[language_classification] Runtime language classifier can now use dense or sparse weights, with a different header signature for the sparse version (using old signature for the dense version, so backward-compatible)
2017-04-02 23:51:54 -04:00
Al
835d851310
[log] log the offending line if token count does not match in language_classifier_io
2017-04-02 23:47:07 -04:00
Al
964ac15e51
[language_classification] adding options to language_classifier_train for using SGD with {L2, L1} regularization or FTRL-Proximal using both.
...
1. Creates sparse matrix for L1 SGD and FTRL
2. Uses the one standard-error rule during cross-validation.
Parameters within one standard error of the lowest-cost solution
are preferred if they are better regularized.
3. Pulls weights matrix for only the features that occurred
in a given batch. In the case of FTRL, this needs to be computed
each on each batch, so the sparsity helps here.
2017-04-02 23:46:14 -04:00
Al
58661c9f27
[languages] adding replace_hyphens and split_alpha_from_numeric in language classifier input normalization
2017-04-02 23:32:24 -04:00
Al
e4ed759f0d
[math] using new matrix methods in softmax
2017-04-02 23:29:52 -04:00
Al
3aab15a0a0
[math] adding mean, variance and standard deviation to generic vector functions
2017-04-02 23:29:15 -04:00
Al
3cb513a8f2
[utils] hash_get is no longer a string-only function, can be used for generic hashtables
2017-04-02 23:28:17 -04:00
Al
95e39ad91c
[utils] removing default chunk size from address_parser_train
2017-04-02 23:26:51 -04:00
Al
a4431dbb27
[classification] removing regularization update from gradient computation in logistic regression, as that's now handled by the optimizer
2017-04-02 14:32:14 -04:00
Al
64c049730a
[classification] flexible logistic regression trainer that can handle either SGD (with either L1 or L2) or FTRL as optimiers
2017-04-02 14:30:14 -04:00
Al
cf88bc7f65
[optimization] implemented Google's FTRL-Proximal, adapted for the multiclass/multinomial case. It is L1 and L2 regularized, and should both encourage sparsity with the L1 penalty while being robust to collinearity of features due to the L2 penalty. Ref: https://research.google.com/pubs/archive/41159.pdf
2017-04-02 14:28:25 -04:00
Al
ed05aaabb1
[utils] adding default chunk size to shuffle.h
2017-04-02 13:51:45 -04:00
Al
96e1ca5e89
[utils] sparse_matrix_add_unique_columns_alias, adds the actual column indices to hashtable/array and aliases those in the table from 1 to N (where N is the number of unique columns in this batch). This way it's compatible with smaller matrices of batch weights.
2017-04-02 13:48:46 -04:00
Al
a2563a4dcd
[optimization] new sgd_trainer struct to manage weights in stochastic gradient descent, allows L1 or L2 regularization, cumulative penalties instead of exponential decay, SGD using L1 regularization encouraged sparsity and can produce a sparse matrix after training rather than a dense one
2017-04-02 13:44:59 -04:00
Al
19fe084974
[utils] adding non-branching sign functions
2017-04-02 13:41:57 -04:00
Al
74a281e332
[dictionaries] more abbreviations for MLK
2017-04-01 00:54:14 -04:00
Al
7f30fb8e38
[openaddresses] add OSM boundaries to King, NC
2017-03-31 21:13:32 -04:00
Al
b52f137b5d
[openaddresses] adding units to Chelan County, WA, adding Island County, WA
2017-03-31 18:08:43 -04:00
Al
6ec4c1fdc9
[openaddresses] adding units to city of Columbia, MO
2017-03-31 17:44:04 -04:00
Al
f349607412
[openaddresses] adding units in Boone County, MO
2017-03-31 17:27:35 -04:00
Al
bd8de15886
[openaddresses] OSM boundaries no longer needed in Alamance County, NC. Ignore city when it's {ALAMANCECOUNTY, COUNTY}
2017-03-31 17:24:45 -04:00
Al
267be6c05c
[data] 12 worker pool in data download instead of 10 to download the new parser in one shot
2017-03-31 15:52:17 -04:00
Al
7f8c2f0ad3
[fix] remove bloom.c from libpostal sources
2017-03-31 15:22:48 -04:00
Al
a64c81b45b
[data/models] updating libpostal download script to download new models. The simple data files are stored by libpostal major version, whereas the models are stored by the version of the training data they used. A file called "latest" is stored in S3 to indicate the latest version of the model and checked on make
2017-03-31 13:35:07 -04:00
Al
6d4c7984df
[api] doing this now since we're bumping a major version. Using a libpostal prefixes for all public header functions and definitions
2017-03-31 03:35:51 -04:00
Al
f8d7bdf364
[build] defining libpostal .so version in configure.ac, removing dependency on mmap and sparkey
2017-03-31 03:24:19 -04:00
Al
f7b695c642
[build] add /usr/local/include as default include path for test Makefile as well
2017-03-30 15:57:17 -04:00
Al
ace40bf0aa
[rm] removing ax_blas.m4
2017-03-30 15:53:53 -04:00