Al
|
63d239eef0
|
[tokenization] Using the new re2c 0.16 generates a 75% smaller DFA for scanner, should speed up compile times on gcc
|
2016-01-30 02:20:01 -05:00 |
|
Al
|
9b3296914a
|
[build] Defining LIBPOSTAL_DATA_DIR at compile time, not configure
|
2016-01-30 02:18:12 -05:00 |
|
Al
|
cd76c660d8
|
[fix] French numex
|
2016-01-28 16:40:50 -05:00 |
|
Al
|
95a7978131
|
[build] Adding relevant language_classifier sources to build
|
2016-01-27 03:34:35 -05:00 |
|
Al
|
93ed2bf15b
|
[api] Making language optional in libpostal cli
|
2016-01-27 03:32:29 -05:00 |
|
Al
|
789db8f582
|
[build] Adding language classifier to data file download script. As the current file is rather large, added multipart downloads from S3 to speed things up
|
2016-01-27 03:31:45 -05:00 |
|
Al
|
42d169feee
|
[api] Libpostal expand API will now detect language automatically using a high accuracy language classifier trained on OSM streets/addresses/toponyms. Hooray batch geocoding!
|
2016-01-27 03:23:51 -05:00 |
|
Al
|
71c51f2e45
|
[language_classification] Making directory optional on language_classifier client/test program
|
2016-01-27 03:18:53 -05:00 |
|
Al
|
c770468d03
|
[expansion] Regenerated address_expansion_data.c
|
2016-01-27 03:17:59 -05:00 |
|
Al
|
36f52d9707
|
[fix] Removing feature printing
|
2016-01-26 15:34:56 -05:00 |
|
Al
|
5077462754
|
[fix] temporary files for language classifier training
|
2016-01-26 01:42:21 -05:00 |
|
Al
|
426edccbf8
|
[language_classification] Simple accuracy-based test program for language classifier.
|
2016-01-26 01:29:56 -05:00 |
|
Al
|
9abbf42bf4
|
[language_classifier] Command-line client for language classification
|
2016-01-26 01:20:59 -05:00 |
|
Al
|
314b65e192
|
[build] Adding shuffle.c to language_classifier_train
|
2016-01-26 01:18:35 -05:00 |
|
Al
|
ababb8f2d0
|
[fix] sign comparison in regularized gradient computation for logistic regression
|
2016-01-26 01:16:16 -05:00 |
|
Al
|
ae2b839f17
|
[build] Adding language classifier train/test/cli programs to the build
|
2016-01-26 00:09:07 -05:00 |
|
Al
|
5d5d5713cc
|
[transliteration] Regenerating transliterator scripts
|
2016-01-18 12:04:14 -05:00 |
|
Al
|
0dfd8d6439
|
[language_classification] Adding script feature for any non-Latin script. Even if the script doesn't directly identify the language, it can act as a modified intercept (all Han script addresses will share the Han feature, even if we haven't seen one of the > 80k Han characters)
|
2016-01-17 21:37:45 -05:00 |
|
Al
|
b9a3230f65
|
[language_classification] Removing the per-country classifier, text-based alone is doing close to 99% accuracy now
|
2016-01-17 21:13:14 -05:00 |
|
Al
|
f808f74271
|
[language_classification] Automatic hyperparameter optimization using either the cross-validation set or two distinct subsets of the training set
|
2016-01-17 21:11:37 -05:00 |
|
Al
|
af5689ee52
|
[fix] removing unused var
|
2016-01-17 21:00:17 -05:00 |
|
Al
|
7d727fc8f0
|
[optimization] Using adapted learning rate in stochastic gradient descent (if lambda > 0)
|
2016-01-17 20:59:47 -05:00 |
|
Al
|
7b300639f1
|
[fix] Trie prefix search tail comparison
|
2016-01-17 20:56:37 -05:00 |
|
Al
|
70dbfdd560
|
[unicode] Regenerating unicode_script_data.c
|
2016-01-17 20:53:44 -05:00 |
|
Al
|
de240d2b94
|
[fix] tokenize_add_tokens respects specified length
|
2016-01-17 20:51:47 -05:00 |
|
Al
|
10cadc67d7
|
[io] matrix_read using array I/O functions
|
2016-01-17 20:40:18 -05:00 |
|
Al
|
baba826d21
|
[io] Cutting down on system calls in trie_read
|
2016-01-17 20:39:19 -05:00 |
|
Al
|
cba2acc21f
|
[io] Sparse matrix using array I/O methods
|
2016-01-17 20:38:16 -05:00 |
|
Al
|
46b35c5202
|
[utils] Adding functions to read numeric arrays from files
|
2016-01-17 20:36:57 -05:00 |
|
Al
|
d4143c1685
|
[parsing] Adding an optimization to the parser API where, if the entire input is a single known geographic phrase like New York, it returns the most likely label from the training data. That way e.g. a search for 'Florida' doesn't get tagged as 'house.' This doesn't affect training, only prediction.
|
2016-01-15 20:07:21 -05:00 |
|
Al
|
622dc354e7
|
[optimization] Adding learning rate to lazy sparse update in stochastic gradient descent
|
2016-01-12 11:04:16 -05:00 |
|
Al
|
79f2b7c192
|
[build] Removing source from libpostal shared lib
|
2016-01-12 10:31:22 -05:00 |
|
Al
|
6a9c1e8c6d
|
[build] Adding trie_utils.c to address parser train/test
|
2016-01-12 10:22:34 -05:00 |
|
Al
|
7cc201dec3
|
[optimization] Moving gamma_t calculation to the header in SGD
|
2016-01-11 16:40:50 -05:00 |
|
Al
|
25ae5bed33
|
[unicode] Adding SCRIPT_INHERITED as a common script so diacritics like COMBING CEDILLA don't break the current script and produce false word breaks
|
2016-01-11 16:39:21 -05:00 |
|
Al
|
3260edcf18
|
[math] Adding sparse dot sparse given a dense output matrix (suitable for the minibatch use case), fixing sparse dot vector
|
2016-01-11 13:55:54 -05:00 |
|
Al
|
736bc7c70d
|
[config] language_classifier data dir
|
2016-01-10 03:05:36 -05:00 |
|
Al
|
ebaedb6bcf
|
[language_classifier] Language classifier training using L2-regularized logistic regression and stochastic gradient descent
|
2016-01-10 01:31:18 -05:00 |
|
Al
|
56710cce21
|
[language_classifier] Language classifier data set I/O
|
2016-01-10 01:22:29 -05:00 |
|
Al
|
0558475a50
|
[language_classifier] Language classifier structs, I/O and API
|
2016-01-10 01:20:17 -05:00 |
|
Al
|
b85e454a58
|
[fix] var
|
2016-01-09 03:43:53 -05:00 |
|
Al
|
b13462f8ef
|
[language_classifier] Features for address languages classification, quadgrams for most languages, unigrams for ideographic characters, script for single-script languages like Thai, Hebrew, etc.
|
2016-01-09 03:42:57 -05:00 |
|
Al
|
29930fa7b6
|
[fix] sort hash keys by value
|
2016-01-09 03:38:25 -05:00 |
|
Al
|
62017fd33d
|
[optimization] Using sparse updates in stochastic gradient descent. Decomposing the updates into the gradient of the loss function (zero for features not observed in the current batch) and the gradient of the regularization term. The derivative of the regularization term in L2-regularized models is equivalent to an exponential decay function. Before computing the gradient for the current batch, we bring the weights up to date only for the features observed in that batch, and update only those values
|
2016-01-09 03:37:31 -05:00 |
|
Al
|
aa22db11b2
|
[math] Matrix arithmetic
|
2016-01-09 01:45:10 -05:00 |
|
Al
|
197b18f3cf
|
[fix] NULL check
|
2016-01-09 01:43:25 -05:00 |
|
Al
|
9c4b5ccbb1
|
[math] Adding array_{op}_times_scalar methods
|
2016-01-09 01:42:54 -05:00 |
|
Al
|
2f1e2139ca
|
[math] Unique columns as array for CSR sparse matrix
|
2016-01-09 01:40:26 -05:00 |
|
Al
|
023c04d78f
|
[classification] Pre-allocating memory in logistic regression trainer, storing last updated timestamps for sparse stochastic gradient descent and using the new gradient API
|
2016-01-09 01:39:24 -05:00 |
|
Al
|
562cc06eaf
|
[classification] Sparse version of logistic regression gradient which, given an array of the features/columns used in the input batch, only updates the gradient for that batch, even for the operations which otherwise would apply to the entire matrix (scaling by -1/m, regularization)
|
2016-01-09 01:33:33 -05:00 |
|