Commit Graph

993 Commits

Author SHA1 Message Date
Al
0c6af2b74c [fix] normalize canonical strings (after expanding abbreviations, concatenated suffixes, etc.) with Latin-ASCII, Latin-ASCII-Simple or simple UTF-8 normalization depending on the options 2017-08-03 14:08:05 -06:00
Al
97044f5a8b [fix] 32-bit safety in numex table loading 2017-07-20 17:55:43 -04:00
Iestyn Pryce
b96a687182 Merge https://github.com/openvenues/libpostal 2017-05-29 18:23:03 +01:00
Travis
8dd84b71ba [auto][ci skip] Adding data files from Travis build #250 2017-05-24 05:05:06 +00:00
Iestyn Pryce
87cf7b5bca Add portable way of formatting khint_t type (from klib) 2017-05-21 11:58:37 +01:00
Iestyn Pryce
d8239a9cc4 Revert format regression introduced in ecd07b18c1 2017-05-21 11:14:21 +01:00
Iestyn Pryce
73d27caeb9 Fix log_* formats which expect long long uint but receive uint64_t. 2017-05-21 10:57:20 +01:00
Iestyn Pryce
6aa3cb61fd Fix log_* formats which expect long long int but receive int64_t. 2017-05-21 10:29:34 +01:00
Iestyn Pryce
ecd07b18c1 Fix log_* formats which expect size_t but receive uint32_t. 2017-05-19 22:31:56 +01:00
Iestyn Pryce
87a76bf967 Fix log_{debug,info} formats which expect size_t but receive int. 2017-05-17 22:40:53 +01:00
Iestyn Pryce
f34fc56fec Fix log_debug formats which expect unsigned int but receive size_t 2017-05-14 17:48:26 +01:00
Al
a7e67c4967 [fix] adding maximum number of permutations for libpostal_expand_address to consider (n=100 for both the inner and outer loop, so max strings=10000), fixes #200 2017-05-13 14:11:08 -04:00
Al
5780a08b48 [fix] check that possible ordinal suffix also has non-zero digit length before normalizing 2017-05-12 15:48:20 -04:00
Al
cea3ced533 [fix] open files in binary format for #69 2017-05-03 17:34:38 -04:00
Al
6ea2273263 [fix] terminate the char_array if input token is zero-length in add_normalized_token 2017-04-28 11:25:07 -04:00
Al
278679b7fb [fix] in tokenized trie_search, in the case of a partial failed match, reset to the root node before rolling the pointer back to phrase start + 1 2017-04-21 13:51:07 -04:00
Travis
074b6ff802 [auto][ci skip] Adding data files from Travis build #231 2017-04-20 02:39:39 +00:00
Travis
4762ff2638 [auto][ci skip] Adding data files from Travis build #228 2017-04-20 00:51:42 +00:00
Al
f3adde746e [numex] adding ability to handle handle the degree symbol in numex parsing since it's technically a separate token 2017-04-19 20:18:21 -04:00
Oliver Keyes
35821f975e Remove unused variable
What it says on the tin!
2017-04-18 21:25:00 -07:00
Al
f3cf119e58 [build] Makefile changes to support moving numeric expression parsing to normalize.c 2017-04-18 21:41:24 -04:00
Al
cddc368533 [numex] adding one form of normalization which strips ordinal suffixes so {96th, Ninety-sixth} => 96. This is an additional form of normalization, so there's still one form where the suffixes are kept. One case that's still not handled is something like "IXe Arrondissement" 2017-04-18 21:39:54 -04:00
Al
92051863ba [numex] adding ordinal suffixes themselves to the numex trie so they can be removed from strings 2017-04-18 17:20:02 -04:00
Al Barrentine
63ac3cf921 Merge pull request #183 from openvenues/cdn
Hosting model files and training data on CloudFront CDN
2017-04-17 14:39:35 -04:00
Al
d2732922c2 [data] deployed model files and training data to CloudFront for easier downloading around the world and in places like China where the Great Fire Wall may prevent large downloads from abroad. TTL is set to 0 so it still caches the files themselves but checks with origin for the If-Modified-Since headers, allowing the files to be updated dynamically 2017-04-17 14:11:44 -04:00
Al Barrentine
5699ef3da0 Merge pull request #181 from eefi/bug/various/initializer
[fix] don't use unnamed fields in initializers
2017-04-13 16:22:33 -04:00
Al
36dc41af8c Merge branch 'master' of https://github.com/openvenues/libpostal 2017-04-13 16:02:06 -04:00
Al
413c584f08 [fix] need to set prev_state to the NULL state in numex parsing after a non-space/non-hyphen is encountered and the previous match, if any, is added to the result array 2017-04-13 16:01:46 -04:00
Austin Chu
f9b57dbd42 [fix] don't use unnamed fields in initializers
GCC did not support assigning to unnamed fields from designated
initializers until 4.6 [1]. Unfortunately, CentOS 6 ships with GCC 4.4,
so avoiding this C99 feature is necessary to fix building in CentOS 6
environments.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10676
2017-04-13 14:44:20 -04:00
Austin Chu
a966712e18 [fix] add #include guard to tagger.h 2017-04-13 13:02:03 -04:00
Austin Chu
19a04511ba [fix] typo in compiler warning when no CBLAS found 2017-04-12 20:40:08 -04:00
Al
b464eb6c07 [numex] fix numex parsing when the spelled-out number is followed by a comma or other punctuation 2017-04-11 16:28:33 -04:00
Al
7f7aada32a [build] add another housekeeping file in the datadir for data_version. Blow away the exiting files if that file either doesn't exist or doesn't contain a matching version string to help with upgrades 2017-04-07 17:40:27 -04:00
Al
5a96be5d5c [fix][ci skip] S3 upload paths in data upload/download script 2017-04-06 00:37:12 -04:00
Travis
d8409f1f38 [auto][ci skip] Adding data files from Travis build #210 2017-04-06 04:06:16 +00:00
Al
c01e67c1e4 [fix] removing one of the warnings about C90 since this is entirely C99. 2017-04-05 14:51:23 -04:00
Al
caebf4e2c9 [classification] correcting cost functions in SGD and FTRL for use in parameter sweeps 2017-04-05 14:18:13 -04:00
Al
6219cc6378 [numex] add dehyphenated form when building numex table 2017-04-05 14:06:19 -04:00
Al
22443e31cc [parser] removing special commands other than .exit from address_parser_cli 2017-04-04 20:49:37 -04:00
Al
8742574257 [parser] storing address_parser_context on the parser struct itself so it doesn't have to be allocated every time 2017-04-04 20:40:55 -04:00
Al
eff7a7a27a [optimization] moving regularization methods to their own module 2017-04-03 00:16:30 -04:00
Al
957aa0c0c9 [utils] cartesian product iterator for grid search during model selection 2017-04-03 00:15:31 -04:00
Al
4a72afc712 [build] Makefile changes for new language_classifier_train 2017-04-02 23:55:31 -04:00
Al
378a11c88f [fix] expansion array destroy API in libpostal expand program 2017-04-02 23:55:04 -04:00
Al
c5e2f89ee9 [fix] declaring is_common_script function as static 2017-04-02 23:53:21 -04:00
Al
5dfdd4b7eb [language_classification] Runtime language classifier can now use dense or sparse weights, with a different header signature for the sparse version (using old signature for the dense version, so backward-compatible) 2017-04-02 23:51:54 -04:00
Al
835d851310 [log] log the offending line if token count does not match in language_classifier_io 2017-04-02 23:47:07 -04:00
Al
964ac15e51 [language_classification] adding options to language_classifier_train for using SGD with {L2, L1} regularization or FTRL-Proximal using both.
1. Creates sparse matrix for L1 SGD and FTRL
    2. Uses the one standard-error rule during cross-validation.
    Parameters within one standard error of the lowest-cost solution
    are preferred if they are better regularized.
    3. Pulls weights matrix for only the features that occurred
    in a given batch. In the case of FTRL, this needs to be computed
    each on each batch, so the sparsity helps here.
2017-04-02 23:46:14 -04:00
Al
58661c9f27 [languages] adding replace_hyphens and split_alpha_from_numeric in language classifier input normalization 2017-04-02 23:32:24 -04:00
Al
e4ed759f0d [math] using new matrix methods in softmax 2017-04-02 23:29:52 -04:00