Al
278679b7fb
[fix] in tokenized trie_search, in the case of a partial failed match, reset to the root node before rolling the pointer back to phrase start + 1
2017-04-21 13:51:07 -04:00
Travis
074b6ff802
[auto][ci skip] Adding data files from Travis build #231
2017-04-20 02:39:39 +00:00
Travis
4762ff2638
[auto][ci skip] Adding data files from Travis build #228
2017-04-20 00:51:42 +00:00
Al
f3adde746e
[numex] adding ability to handle handle the degree symbol in numex parsing since it's technically a separate token
2017-04-19 20:18:21 -04:00
Oliver Keyes
35821f975e
Remove unused variable
...
What it says on the tin!
2017-04-18 21:25:00 -07:00
Al
f3cf119e58
[build] Makefile changes to support moving numeric expression parsing to normalize.c
2017-04-18 21:41:24 -04:00
Al
cddc368533
[numex] adding one form of normalization which strips ordinal suffixes so {96th, Ninety-sixth} => 96. This is an additional form of normalization, so there's still one form where the suffixes are kept. One case that's still not handled is something like "IXe Arrondissement"
2017-04-18 21:39:54 -04:00
Al
92051863ba
[numex] adding ordinal suffixes themselves to the numex trie so they can be removed from strings
2017-04-18 17:20:02 -04:00
Al Barrentine
63ac3cf921
Merge pull request #183 from openvenues/cdn
...
Hosting model files and training data on CloudFront CDN
2017-04-17 14:39:35 -04:00
Al
d2732922c2
[data] deployed model files and training data to CloudFront for easier downloading around the world and in places like China where the Great Fire Wall may prevent large downloads from abroad. TTL is set to 0 so it still caches the files themselves but checks with origin for the If-Modified-Since headers, allowing the files to be updated dynamically
2017-04-17 14:11:44 -04:00
Al Barrentine
5699ef3da0
Merge pull request #181 from eefi/bug/various/initializer
...
[fix] don't use unnamed fields in initializers
2017-04-13 16:22:33 -04:00
Al
36dc41af8c
Merge branch 'master' of https://github.com/openvenues/libpostal
2017-04-13 16:02:06 -04:00
Al
413c584f08
[fix] need to set prev_state to the NULL state in numex parsing after a non-space/non-hyphen is encountered and the previous match, if any, is added to the result array
2017-04-13 16:01:46 -04:00
Austin Chu
f9b57dbd42
[fix] don't use unnamed fields in initializers
...
GCC did not support assigning to unnamed fields from designated
initializers until 4.6 [1]. Unfortunately, CentOS 6 ships with GCC 4.4,
so avoiding this C99 feature is necessary to fix building in CentOS 6
environments.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10676
2017-04-13 14:44:20 -04:00
Austin Chu
a966712e18
[fix] add #include guard to tagger.h
2017-04-13 13:02:03 -04:00
Austin Chu
19a04511ba
[fix] typo in compiler warning when no CBLAS found
2017-04-12 20:40:08 -04:00
Al
b464eb6c07
[numex] fix numex parsing when the spelled-out number is followed by a comma or other punctuation
2017-04-11 16:28:33 -04:00
Al
7f7aada32a
[build] add another housekeeping file in the datadir for data_version. Blow away the exiting files if that file either doesn't exist or doesn't contain a matching version string to help with upgrades
2017-04-07 17:40:27 -04:00
Al
5a96be5d5c
[fix][ci skip] S3 upload paths in data upload/download script
2017-04-06 00:37:12 -04:00
Travis
d8409f1f38
[auto][ci skip] Adding data files from Travis build #210
2017-04-06 04:06:16 +00:00
Al
c01e67c1e4
[fix] removing one of the warnings about C90 since this is entirely C99.
2017-04-05 14:51:23 -04:00
Al
caebf4e2c9
[classification] correcting cost functions in SGD and FTRL for use in parameter sweeps
2017-04-05 14:18:13 -04:00
Al
6219cc6378
[numex] add dehyphenated form when building numex table
2017-04-05 14:06:19 -04:00
Al
22443e31cc
[parser] removing special commands other than .exit from address_parser_cli
2017-04-04 20:49:37 -04:00
Al
8742574257
[parser] storing address_parser_context on the parser struct itself so it doesn't have to be allocated every time
2017-04-04 20:40:55 -04:00
Al
eff7a7a27a
[optimization] moving regularization methods to their own module
2017-04-03 00:16:30 -04:00
Al
957aa0c0c9
[utils] cartesian product iterator for grid search during model selection
2017-04-03 00:15:31 -04:00
Al
4a72afc712
[build] Makefile changes for new language_classifier_train
2017-04-02 23:55:31 -04:00
Al
378a11c88f
[fix] expansion array destroy API in libpostal expand program
2017-04-02 23:55:04 -04:00
Al
c5e2f89ee9
[fix] declaring is_common_script function as static
2017-04-02 23:53:21 -04:00
Al
5dfdd4b7eb
[language_classification] Runtime language classifier can now use dense or sparse weights, with a different header signature for the sparse version (using old signature for the dense version, so backward-compatible)
2017-04-02 23:51:54 -04:00
Al
835d851310
[log] log the offending line if token count does not match in language_classifier_io
2017-04-02 23:47:07 -04:00
Al
964ac15e51
[language_classification] adding options to language_classifier_train for using SGD with {L2, L1} regularization or FTRL-Proximal using both.
...
1. Creates sparse matrix for L1 SGD and FTRL
2. Uses the one standard-error rule during cross-validation.
Parameters within one standard error of the lowest-cost solution
are preferred if they are better regularized.
3. Pulls weights matrix for only the features that occurred
in a given batch. In the case of FTRL, this needs to be computed
each on each batch, so the sparsity helps here.
2017-04-02 23:46:14 -04:00
Al
58661c9f27
[languages] adding replace_hyphens and split_alpha_from_numeric in language classifier input normalization
2017-04-02 23:32:24 -04:00
Al
e4ed759f0d
[math] using new matrix methods in softmax
2017-04-02 23:29:52 -04:00
Al
3aab15a0a0
[math] adding mean, variance and standard deviation to generic vector functions
2017-04-02 23:29:15 -04:00
Al
3cb513a8f2
[utils] hash_get is no longer a string-only function, can be used for generic hashtables
2017-04-02 23:28:17 -04:00
Al
95e39ad91c
[utils] removing default chunk size from address_parser_train
2017-04-02 23:26:51 -04:00
Al
a4431dbb27
[classification] removing regularization update from gradient computation in logistic regression, as that's now handled by the optimizer
2017-04-02 14:32:14 -04:00
Al
64c049730a
[classification] flexible logistic regression trainer that can handle either SGD (with either L1 or L2) or FTRL as optimiers
2017-04-02 14:30:14 -04:00
Al
cf88bc7f65
[optimization] implemented Google's FTRL-Proximal, adapted for the multiclass/multinomial case. It is L1 and L2 regularized, and should both encourage sparsity with the L1 penalty while being robust to collinearity of features due to the L2 penalty. Ref: https://research.google.com/pubs/archive/41159.pdf
2017-04-02 14:28:25 -04:00
Al
ed05aaabb1
[utils] adding default chunk size to shuffle.h
2017-04-02 13:51:45 -04:00
Al
96e1ca5e89
[utils] sparse_matrix_add_unique_columns_alias, adds the actual column indices to hashtable/array and aliases those in the table from 1 to N (where N is the number of unique columns in this batch). This way it's compatible with smaller matrices of batch weights.
2017-04-02 13:48:46 -04:00
Al
a2563a4dcd
[optimization] new sgd_trainer struct to manage weights in stochastic gradient descent, allows L1 or L2 regularization, cumulative penalties instead of exponential decay, SGD using L1 regularization encouraged sparsity and can produce a sparse matrix after training rather than a dense one
2017-04-02 13:44:59 -04:00
Al
19fe084974
[utils] adding non-branching sign functions
2017-04-02 13:41:57 -04:00
Al
267be6c05c
[data] 12 worker pool in data download instead of 10 to download the new parser in one shot
2017-03-31 15:52:17 -04:00
Al
7f8c2f0ad3
[fix] remove bloom.c from libpostal sources
2017-03-31 15:22:48 -04:00
Al
a64c81b45b
[data/models] updating libpostal download script to download new models. The simple data files are stored by libpostal major version, whereas the models are stored by the version of the training data they used. A file called "latest" is stored in S3 to indicate the latest version of the model and checked on make
2017-03-31 13:35:07 -04:00
Al
6d4c7984df
[api] doing this now since we're bumping a major version. Using a libpostal prefixes for all public header functions and definitions
2017-03-31 03:35:51 -04:00
Al
f7889bf138
[fix] removing WIP
2017-03-29 20:46:56 -04:00