Al
|
9ac0379a65
|
[phrases] Case where trie search finds a match, makes progress beyond the next token but has to fall back. Adding trie search test case
|
2016-02-08 01:07:56 -05:00 |
|
Al
|
3701d8380f
|
[cli] Command-line expansion client now supports piping in stdin, Unix-style
|
2016-02-03 13:48:51 -05:00 |
|
Al Barrentine
|
7536fa4647
|
[fix] static inline
|
2016-02-02 00:53:13 -05:00 |
|
Al
|
c0b548833b
|
[fix] create data dir if it doesn't exist
|
2016-01-30 13:40:10 -05:00 |
|
Al
|
1e65fafaaf
|
[fix] char *
|
2016-01-30 13:39:36 -05:00 |
|
Al
|
f8de9d8e5a
|
[fix] static methods in numex table loading, mallocs instead of stack variables
|
2016-01-30 13:25:48 -05:00 |
|
Al
|
085bfd6ada
|
[fix] static methods for libpostal.c
|
2016-01-30 02:20:59 -05:00 |
|
Al
|
63d239eef0
|
[tokenization] Using the new re2c 0.16 generates a 75% smaller DFA for scanner, should speed up compile times on gcc
|
2016-01-30 02:20:01 -05:00 |
|
Al
|
9b3296914a
|
[build] Defining LIBPOSTAL_DATA_DIR at compile time, not configure
|
2016-01-30 02:18:12 -05:00 |
|
Al
|
cd76c660d8
|
[fix] French numex
|
2016-01-28 16:40:50 -05:00 |
|
Al
|
95a7978131
|
[build] Adding relevant language_classifier sources to build
|
2016-01-27 03:34:35 -05:00 |
|
Al
|
93ed2bf15b
|
[api] Making language optional in libpostal cli
|
2016-01-27 03:32:29 -05:00 |
|
Al
|
789db8f582
|
[build] Adding language classifier to data file download script. As the current file is rather large, added multipart downloads from S3 to speed things up
|
2016-01-27 03:31:45 -05:00 |
|
Al
|
42d169feee
|
[api] Libpostal expand API will now detect language automatically using a high accuracy language classifier trained on OSM streets/addresses/toponyms. Hooray batch geocoding!
|
2016-01-27 03:23:51 -05:00 |
|
Al
|
71c51f2e45
|
[language_classification] Making directory optional on language_classifier client/test program
|
2016-01-27 03:18:53 -05:00 |
|
Al
|
c770468d03
|
[expansion] Regenerated address_expansion_data.c
|
2016-01-27 03:17:59 -05:00 |
|
Al
|
36f52d9707
|
[fix] Removing feature printing
|
2016-01-26 15:34:56 -05:00 |
|
Al
|
5077462754
|
[fix] temporary files for language classifier training
|
2016-01-26 01:42:21 -05:00 |
|
Al
|
426edccbf8
|
[language_classification] Simple accuracy-based test program for language classifier.
|
2016-01-26 01:29:56 -05:00 |
|
Al
|
9abbf42bf4
|
[language_classifier] Command-line client for language classification
|
2016-01-26 01:20:59 -05:00 |
|
Al
|
314b65e192
|
[build] Adding shuffle.c to language_classifier_train
|
2016-01-26 01:18:35 -05:00 |
|
Al
|
ababb8f2d0
|
[fix] sign comparison in regularized gradient computation for logistic regression
|
2016-01-26 01:16:16 -05:00 |
|
Al
|
ae2b839f17
|
[build] Adding language classifier train/test/cli programs to the build
|
2016-01-26 00:09:07 -05:00 |
|
Al
|
5d5d5713cc
|
[transliteration] Regenerating transliterator scripts
|
2016-01-18 12:04:14 -05:00 |
|
Al
|
0dfd8d6439
|
[language_classification] Adding script feature for any non-Latin script. Even if the script doesn't directly identify the language, it can act as a modified intercept (all Han script addresses will share the Han feature, even if we haven't seen one of the > 80k Han characters)
|
2016-01-17 21:37:45 -05:00 |
|
Al
|
b9a3230f65
|
[language_classification] Removing the per-country classifier, text-based alone is doing close to 99% accuracy now
|
2016-01-17 21:13:14 -05:00 |
|
Al
|
f808f74271
|
[language_classification] Automatic hyperparameter optimization using either the cross-validation set or two distinct subsets of the training set
|
2016-01-17 21:11:37 -05:00 |
|
Al
|
af5689ee52
|
[fix] removing unused var
|
2016-01-17 21:00:17 -05:00 |
|
Al
|
7d727fc8f0
|
[optimization] Using adapted learning rate in stochastic gradient descent (if lambda > 0)
|
2016-01-17 20:59:47 -05:00 |
|
Al
|
7b300639f1
|
[fix] Trie prefix search tail comparison
|
2016-01-17 20:56:37 -05:00 |
|
Al
|
70dbfdd560
|
[unicode] Regenerating unicode_script_data.c
|
2016-01-17 20:53:44 -05:00 |
|
Al
|
de240d2b94
|
[fix] tokenize_add_tokens respects specified length
|
2016-01-17 20:51:47 -05:00 |
|
Al
|
10cadc67d7
|
[io] matrix_read using array I/O functions
|
2016-01-17 20:40:18 -05:00 |
|
Al
|
baba826d21
|
[io] Cutting down on system calls in trie_read
|
2016-01-17 20:39:19 -05:00 |
|
Al
|
cba2acc21f
|
[io] Sparse matrix using array I/O methods
|
2016-01-17 20:38:16 -05:00 |
|
Al
|
46b35c5202
|
[utils] Adding functions to read numeric arrays from files
|
2016-01-17 20:36:57 -05:00 |
|
Al
|
d4143c1685
|
[parsing] Adding an optimization to the parser API where, if the entire input is a single known geographic phrase like New York, it returns the most likely label from the training data. That way e.g. a search for 'Florida' doesn't get tagged as 'house.' This doesn't affect training, only prediction.
|
2016-01-15 20:07:21 -05:00 |
|
Al
|
622dc354e7
|
[optimization] Adding learning rate to lazy sparse update in stochastic gradient descent
|
2016-01-12 11:04:16 -05:00 |
|
Al
|
79f2b7c192
|
[build] Removing source from libpostal shared lib
|
2016-01-12 10:31:22 -05:00 |
|
Al
|
6a9c1e8c6d
|
[build] Adding trie_utils.c to address parser train/test
|
2016-01-12 10:22:34 -05:00 |
|
Al
|
7cc201dec3
|
[optimization] Moving gamma_t calculation to the header in SGD
|
2016-01-11 16:40:50 -05:00 |
|
Al
|
25ae5bed33
|
[unicode] Adding SCRIPT_INHERITED as a common script so diacritics like COMBING CEDILLA don't break the current script and produce false word breaks
|
2016-01-11 16:39:21 -05:00 |
|
Al
|
3260edcf18
|
[math] Adding sparse dot sparse given a dense output matrix (suitable for the minibatch use case), fixing sparse dot vector
|
2016-01-11 13:55:54 -05:00 |
|
Al
|
736bc7c70d
|
[config] language_classifier data dir
|
2016-01-10 03:05:36 -05:00 |
|
Al
|
ebaedb6bcf
|
[language_classifier] Language classifier training using L2-regularized logistic regression and stochastic gradient descent
|
2016-01-10 01:31:18 -05:00 |
|
Al
|
56710cce21
|
[language_classifier] Language classifier data set I/O
|
2016-01-10 01:22:29 -05:00 |
|
Al
|
0558475a50
|
[language_classifier] Language classifier structs, I/O and API
|
2016-01-10 01:20:17 -05:00 |
|
Al
|
b85e454a58
|
[fix] var
|
2016-01-09 03:43:53 -05:00 |
|
Al
|
b13462f8ef
|
[language_classifier] Features for address languages classification, quadgrams for most languages, unigrams for ideographic characters, script for single-script languages like Thai, Hebrew, etc.
|
2016-01-09 03:42:57 -05:00 |
|
Al
|
29930fa7b6
|
[fix] sort hash keys by value
|
2016-01-09 03:38:25 -05:00 |
|