Commit Graph

664 Commits

Author SHA1 Message Date
Al
cf2a79bef1 [api] Default options accessible through getters, not static structs 2016-02-15 17:34:00 -05:00
Al
98c395d34c [numex] Concatenating a string of numeric expressions with no intervening tokens like Seventeen Eighty or Ten Oh Four 2016-02-10 09:21:31 -05:00
Al
59cf5bfc62 [numex] Fixing cases with stopwords not attached to a numeric expression 2016-02-10 08:30:01 -05:00
Al
c32ef9ccf8 [fix] freeing up iterator in normalize_string 2016-02-09 01:06:51 -05:00
Al
12c2477359 [phrases] Another fix to tail token search 2016-02-08 17:55:21 -05:00
Al
39f162b029 [phrases] fix in tokenized tail search when whitespace tokens are preserved 2016-02-08 16:37:52 -05:00
Al
84d5ba18f0 [api] Fixing multi-language expansions with overlapping expansions, whitespace, utf8 normalization of canonical strings 2016-02-08 02:50:34 -05:00
Al
0695738253 [fix] cleaning up memory in normalize_string_languages 2016-02-08 02:43:12 -05:00
Al
afd5844f21 [normalize] Permuting transliterators only once on the entire string rather than at each script break (so # permutations is bounded and can't get huge). Fixing some spacing issues. Adding method to check for an alpha+numeric token in normalization. 2016-02-08 01:16:47 -05:00
Al
aaad213a20 [cli] Adding printf while models are being loaded in address parser cli 2016-02-08 01:10:06 -05:00
Al
9ac0379a65 [phrases] Case where trie search finds a match, makes progress beyond the next token but has to fall back. Adding trie search test case 2016-02-08 01:07:56 -05:00
Al
3701d8380f [cli] Command-line expansion client now supports piping in stdin, Unix-style 2016-02-03 13:48:51 -05:00
Al Barrentine
7536fa4647 [fix] static inline 2016-02-02 00:53:13 -05:00
Al
c0b548833b [fix] create data dir if it doesn't exist 2016-01-30 13:40:10 -05:00
Al
1e65fafaaf [fix] char * 2016-01-30 13:39:36 -05:00
Al
f8de9d8e5a [fix] static methods in numex table loading, mallocs instead of stack variables 2016-01-30 13:25:48 -05:00
Al
085bfd6ada [fix] static methods for libpostal.c 2016-01-30 02:20:59 -05:00
Al
63d239eef0 [tokenization] Using the new re2c 0.16 generates a 75% smaller DFA for scanner, should speed up compile times on gcc 2016-01-30 02:20:01 -05:00
Al
9b3296914a [build] Defining LIBPOSTAL_DATA_DIR at compile time, not configure 2016-01-30 02:18:12 -05:00
Al
cd76c660d8 [fix] French numex 2016-01-28 16:40:50 -05:00
Al
95a7978131 [build] Adding relevant language_classifier sources to build 2016-01-27 03:34:35 -05:00
Al
93ed2bf15b [api] Making language optional in libpostal cli 2016-01-27 03:32:29 -05:00
Al
789db8f582 [build] Adding language classifier to data file download script. As the current file is rather large, added multipart downloads from S3 to speed things up 2016-01-27 03:31:45 -05:00
Al
42d169feee [api] Libpostal expand API will now detect language automatically using a high accuracy language classifier trained on OSM streets/addresses/toponyms. Hooray batch geocoding! 2016-01-27 03:23:51 -05:00
Al
71c51f2e45 [language_classification] Making directory optional on language_classifier client/test program 2016-01-27 03:18:53 -05:00
Al
c770468d03 [expansion] Regenerated address_expansion_data.c 2016-01-27 03:17:59 -05:00
Al
36f52d9707 [fix] Removing feature printing 2016-01-26 15:34:56 -05:00
Al
5077462754 [fix] temporary files for language classifier training 2016-01-26 01:42:21 -05:00
Al
426edccbf8 [language_classification] Simple accuracy-based test program for language classifier. 2016-01-26 01:29:56 -05:00
Al
9abbf42bf4 [language_classifier] Command-line client for language classification 2016-01-26 01:20:59 -05:00
Al
314b65e192 [build] Adding shuffle.c to language_classifier_train 2016-01-26 01:18:35 -05:00
Al
ababb8f2d0 [fix] sign comparison in regularized gradient computation for logistic regression 2016-01-26 01:16:16 -05:00
Al
ae2b839f17 [build] Adding language classifier train/test/cli programs to the build 2016-01-26 00:09:07 -05:00
Al
5d5d5713cc [transliteration] Regenerating transliterator scripts 2016-01-18 12:04:14 -05:00
Al
0dfd8d6439 [language_classification] Adding script feature for any non-Latin script. Even if the script doesn't directly identify the language, it can act as a modified intercept (all Han script addresses will share the Han feature, even if we haven't seen one of the > 80k Han characters) 2016-01-17 21:37:45 -05:00
Al
b9a3230f65 [language_classification] Removing the per-country classifier, text-based alone is doing close to 99% accuracy now 2016-01-17 21:13:14 -05:00
Al
f808f74271 [language_classification] Automatic hyperparameter optimization using either the cross-validation set or two distinct subsets of the training set 2016-01-17 21:11:37 -05:00
Al
af5689ee52 [fix] removing unused var 2016-01-17 21:00:17 -05:00
Al
7d727fc8f0 [optimization] Using adapted learning rate in stochastic gradient descent (if lambda > 0) 2016-01-17 20:59:47 -05:00
Al
7b300639f1 [fix] Trie prefix search tail comparison 2016-01-17 20:56:37 -05:00
Al
70dbfdd560 [unicode] Regenerating unicode_script_data.c 2016-01-17 20:53:44 -05:00
Al
de240d2b94 [fix] tokenize_add_tokens respects specified length 2016-01-17 20:51:47 -05:00
Al
10cadc67d7 [io] matrix_read using array I/O functions 2016-01-17 20:40:18 -05:00
Al
baba826d21 [io] Cutting down on system calls in trie_read 2016-01-17 20:39:19 -05:00
Al
cba2acc21f [io] Sparse matrix using array I/O methods 2016-01-17 20:38:16 -05:00
Al
46b35c5202 [utils] Adding functions to read numeric arrays from files 2016-01-17 20:36:57 -05:00
Al
d4143c1685 [parsing] Adding an optimization to the parser API where, if the entire input is a single known geographic phrase like New York, it returns the most likely label from the training data. That way e.g. a search for 'Florida' doesn't get tagged as 'house.' This doesn't affect training, only prediction. 2016-01-15 20:07:21 -05:00
Al
622dc354e7 [optimization] Adding learning rate to lazy sparse update in stochastic gradient descent 2016-01-12 11:04:16 -05:00
Al
79f2b7c192 [build] Removing source from libpostal shared lib 2016-01-12 10:31:22 -05:00
Al
6a9c1e8c6d [build] Adding trie_utils.c to address parser train/test 2016-01-12 10:22:34 -05:00