Commit Graph

704 Commits

Author SHA1 Message Date
Al
9ac0379a65 [phrases] Case where trie search finds a match, makes progress beyond the next token but has to fall back. Adding trie search test case 2016-02-08 01:07:56 -05:00
Al
3701d8380f [cli] Command-line expansion client now supports piping in stdin, Unix-style 2016-02-03 13:48:51 -05:00
Al Barrentine
7536fa4647 [fix] static inline 2016-02-02 00:53:13 -05:00
Al
c0b548833b [fix] create data dir if it doesn't exist 2016-01-30 13:40:10 -05:00
Al
1e65fafaaf [fix] char * 2016-01-30 13:39:36 -05:00
Al
f8de9d8e5a [fix] static methods in numex table loading, mallocs instead of stack variables 2016-01-30 13:25:48 -05:00
Al
085bfd6ada [fix] static methods for libpostal.c 2016-01-30 02:20:59 -05:00
Al
63d239eef0 [tokenization] Using the new re2c 0.16 generates a 75% smaller DFA for scanner, should speed up compile times on gcc 2016-01-30 02:20:01 -05:00
Al
9b3296914a [build] Defining LIBPOSTAL_DATA_DIR at compile time, not configure 2016-01-30 02:18:12 -05:00
Al
cd76c660d8 [fix] French numex 2016-01-28 16:40:50 -05:00
Al
95a7978131 [build] Adding relevant language_classifier sources to build 2016-01-27 03:34:35 -05:00
Al
93ed2bf15b [api] Making language optional in libpostal cli 2016-01-27 03:32:29 -05:00
Al
789db8f582 [build] Adding language classifier to data file download script. As the current file is rather large, added multipart downloads from S3 to speed things up 2016-01-27 03:31:45 -05:00
Al
42d169feee [api] Libpostal expand API will now detect language automatically using a high accuracy language classifier trained on OSM streets/addresses/toponyms. Hooray batch geocoding! 2016-01-27 03:23:51 -05:00
Al
71c51f2e45 [language_classification] Making directory optional on language_classifier client/test program 2016-01-27 03:18:53 -05:00
Al
c770468d03 [expansion] Regenerated address_expansion_data.c 2016-01-27 03:17:59 -05:00
Al
36f52d9707 [fix] Removing feature printing 2016-01-26 15:34:56 -05:00
Al
5077462754 [fix] temporary files for language classifier training 2016-01-26 01:42:21 -05:00
Al
426edccbf8 [language_classification] Simple accuracy-based test program for language classifier. 2016-01-26 01:29:56 -05:00
Al
9abbf42bf4 [language_classifier] Command-line client for language classification 2016-01-26 01:20:59 -05:00
Al
314b65e192 [build] Adding shuffle.c to language_classifier_train 2016-01-26 01:18:35 -05:00
Al
ababb8f2d0 [fix] sign comparison in regularized gradient computation for logistic regression 2016-01-26 01:16:16 -05:00
Al
ae2b839f17 [build] Adding language classifier train/test/cli programs to the build 2016-01-26 00:09:07 -05:00
Al
5d5d5713cc [transliteration] Regenerating transliterator scripts 2016-01-18 12:04:14 -05:00
Al
0dfd8d6439 [language_classification] Adding script feature for any non-Latin script. Even if the script doesn't directly identify the language, it can act as a modified intercept (all Han script addresses will share the Han feature, even if we haven't seen one of the > 80k Han characters) 2016-01-17 21:37:45 -05:00
Al
b9a3230f65 [language_classification] Removing the per-country classifier, text-based alone is doing close to 99% accuracy now 2016-01-17 21:13:14 -05:00
Al
f808f74271 [language_classification] Automatic hyperparameter optimization using either the cross-validation set or two distinct subsets of the training set 2016-01-17 21:11:37 -05:00
Al
af5689ee52 [fix] removing unused var 2016-01-17 21:00:17 -05:00
Al
7d727fc8f0 [optimization] Using adapted learning rate in stochastic gradient descent (if lambda > 0) 2016-01-17 20:59:47 -05:00
Al
7b300639f1 [fix] Trie prefix search tail comparison 2016-01-17 20:56:37 -05:00
Al
70dbfdd560 [unicode] Regenerating unicode_script_data.c 2016-01-17 20:53:44 -05:00
Al
de240d2b94 [fix] tokenize_add_tokens respects specified length 2016-01-17 20:51:47 -05:00
Al
10cadc67d7 [io] matrix_read using array I/O functions 2016-01-17 20:40:18 -05:00
Al
baba826d21 [io] Cutting down on system calls in trie_read 2016-01-17 20:39:19 -05:00
Al
cba2acc21f [io] Sparse matrix using array I/O methods 2016-01-17 20:38:16 -05:00
Al
46b35c5202 [utils] Adding functions to read numeric arrays from files 2016-01-17 20:36:57 -05:00
Al
d4143c1685 [parsing] Adding an optimization to the parser API where, if the entire input is a single known geographic phrase like New York, it returns the most likely label from the training data. That way e.g. a search for 'Florida' doesn't get tagged as 'house.' This doesn't affect training, only prediction. 2016-01-15 20:07:21 -05:00
Al
622dc354e7 [optimization] Adding learning rate to lazy sparse update in stochastic gradient descent 2016-01-12 11:04:16 -05:00
Al
79f2b7c192 [build] Removing source from libpostal shared lib 2016-01-12 10:31:22 -05:00
Al
6a9c1e8c6d [build] Adding trie_utils.c to address parser train/test 2016-01-12 10:22:34 -05:00
Al
7cc201dec3 [optimization] Moving gamma_t calculation to the header in SGD 2016-01-11 16:40:50 -05:00
Al
25ae5bed33 [unicode] Adding SCRIPT_INHERITED as a common script so diacritics like COMBING CEDILLA don't break the current script and produce false word breaks 2016-01-11 16:39:21 -05:00
Al
3260edcf18 [math] Adding sparse dot sparse given a dense output matrix (suitable for the minibatch use case), fixing sparse dot vector 2016-01-11 13:55:54 -05:00
Al
736bc7c70d [config] language_classifier data dir 2016-01-10 03:05:36 -05:00
Al
ebaedb6bcf [language_classifier] Language classifier training using L2-regularized logistic regression and stochastic gradient descent 2016-01-10 01:31:18 -05:00
Al
56710cce21 [language_classifier] Language classifier data set I/O 2016-01-10 01:22:29 -05:00
Al
0558475a50 [language_classifier] Language classifier structs, I/O and API 2016-01-10 01:20:17 -05:00
Al
b85e454a58 [fix] var 2016-01-09 03:43:53 -05:00
Al
b13462f8ef [language_classifier] Features for address languages classification, quadgrams for most languages, unigrams for ideographic characters, script for single-script languages like Thai, Hebrew, etc. 2016-01-09 03:42:57 -05:00
Al
29930fa7b6 [fix] sort hash keys by value 2016-01-09 03:38:25 -05:00