Commit Graph

1332 Commits

Author SHA1 Message Date
Al
5d5d5713cc [transliteration] Regenerating transliterator scripts 2016-01-18 12:04:14 -05:00
Al
fe8f3158f6 [fix] missing file in geodata 2016-01-17 22:23:44 -05:00
Al
5fd9dc7e2b [scripts] relative dirs in setup.py for geodata 2016-01-17 22:22:50 -05:00
Al
da62ff309e [transliteration] Fixing Malayalam script 2016-01-17 22:15:56 -05:00
Al
5385cb71d6 [languages] Adding English dictionaries to Indonesia 2016-01-17 22:08:06 -05:00
Al
8030b235e6 [languages] Changing the definition in script languages so only languages that appear on street signs will be used 2016-01-17 22:03:41 -05:00
Al
0dfd8d6439 [language_classification] Adding script feature for any non-Latin script. Even if the script doesn't directly identify the language, it can act as a modified intercept (all Han script addresses will share the Han feature, even if we haven't seen one of the > 80k Han characters) 2016-01-17 21:37:45 -05:00
Al
b9a3230f65 [language_classification] Removing the per-country classifier, text-based alone is doing close to 99% accuracy now 2016-01-17 21:13:14 -05:00
Al
f808f74271 [language_classification] Automatic hyperparameter optimization using either the cross-validation set or two distinct subsets of the training set 2016-01-17 21:11:37 -05:00
Al
af5689ee52 [fix] removing unused var 2016-01-17 21:00:17 -05:00
Al
7d727fc8f0 [optimization] Using adapted learning rate in stochastic gradient descent (if lambda > 0) 2016-01-17 20:59:47 -05:00
Al
7b300639f1 [fix] Trie prefix search tail comparison 2016-01-17 20:56:37 -05:00
Al
70dbfdd560 [unicode] Regenerating unicode_script_data.c 2016-01-17 20:53:44 -05:00
Al
de240d2b94 [fix] tokenize_add_tokens respects specified length 2016-01-17 20:51:47 -05:00
Al
10cadc67d7 [io] matrix_read using array I/O functions 2016-01-17 20:40:18 -05:00
Al
baba826d21 [io] Cutting down on system calls in trie_read 2016-01-17 20:39:19 -05:00
Al
cba2acc21f [io] Sparse matrix using array I/O methods 2016-01-17 20:38:16 -05:00
Al
46b35c5202 [utils] Adding functions to read numeric arrays from files 2016-01-17 20:36:57 -05:00
Al
3d7dd8966e [languages] Using unicode script in language disambiguation in addition to dictionaries. Eliminating dependency on address_normalizer 2016-01-17 18:28:28 -05:00
Al
fa32eacdd1 [phrases] Adding Python phrase filter from address_normalizer until a Python wrapper around libpostal's trie_search is available 2016-01-17 15:45:02 -05:00
Al
f79a3c5bf4 [osm/polygons] Allowing polygons that GEOS claims are invalid in OSM polygon index (there were some glaring omissions from the index like the polygons for the UK or Berlin). For some reason .buffer(0) creates weird multipolygons that no longer contain their centroids, etc. and aren't useful in reverese geocoding 2016-01-17 15:43:21 -05:00
Al
04f251c1cc [polygons] Don't call fix_polygon (force polygon validity) by default 2016-01-16 21:21:27 -05:00
Al
19a5541a85 [polygons/osm] append polygon nodes by vertices that connect to each other 2016-01-16 21:20:49 -05:00
Al
d4143c1685 [parsing] Adding an optimization to the parser API where, if the entire input is a single known geographic phrase like New York, it returns the most likely label from the training data. That way e.g. a search for 'Florida' doesn't get tagged as 'house.' This doesn't affect training, only prediction. 2016-01-15 20:07:21 -05:00
Al
24b4a680c3 [languages] Adding English dictionaries for Bangladesh 2016-01-14 13:36:07 -05:00
Al
edebdf73e0 [dictionaries] Using long forms as canonical for English degrees as new language models may do some auto-abbreviating 2016-01-14 13:35:41 -05:00
Al
58e53cab1c [scripts] Adding the tokenize/normalize wrappers directly into the internal geodata package so pypostal can be maintained in an independent repo 2016-01-12 13:29:31 -05:00
Al
622dc354e7 [optimization] Adding learning rate to lazy sparse update in stochastic gradient descent 2016-01-12 11:04:16 -05:00
Al
79f2b7c192 [build] Removing source from libpostal shared lib 2016-01-12 10:31:22 -05:00
Al
6a9c1e8c6d [build] Adding trie_utils.c to address parser train/test 2016-01-12 10:22:34 -05:00
Al
7cc201dec3 [optimization] Moving gamma_t calculation to the header in SGD 2016-01-11 16:40:50 -05:00
Al
25ae5bed33 [unicode] Adding SCRIPT_INHERITED as a common script so diacritics like COMBING CEDILLA don't break the current script and produce false word breaks 2016-01-11 16:39:21 -05:00
Al
3260edcf18 [math] Adding sparse dot sparse given a dense output matrix (suitable for the minibatch use case), fixing sparse dot vector 2016-01-11 13:55:54 -05:00
Al
736bc7c70d [config] language_classifier data dir 2016-01-10 03:05:36 -05:00
Al
ebaedb6bcf [language_classifier] Language classifier training using L2-regularized logistic regression and stochastic gradient descent 2016-01-10 01:31:18 -05:00
Al
56710cce21 [language_classifier] Language classifier data set I/O 2016-01-10 01:22:29 -05:00
Al
0558475a50 [language_classifier] Language classifier structs, I/O and API 2016-01-10 01:20:17 -05:00
Al
b85e454a58 [fix] var 2016-01-09 03:43:53 -05:00
Al
b13462f8ef [language_classifier] Features for address languages classification, quadgrams for most languages, unigrams for ideographic characters, script for single-script languages like Thai, Hebrew, etc. 2016-01-09 03:42:57 -05:00
Al
29930fa7b6 [fix] sort hash keys by value 2016-01-09 03:38:25 -05:00
Al
62017fd33d [optimization] Using sparse updates in stochastic gradient descent. Decomposing the updates into the gradient of the loss function (zero for features not observed in the current batch) and the gradient of the regularization term. The derivative of the regularization term in L2-regularized models is equivalent to an exponential decay function. Before computing the gradient for the current batch, we bring the weights up to date only for the features observed in that batch, and update only those values 2016-01-09 03:37:31 -05:00
Al
aa22db11b2 [math] Matrix arithmetic 2016-01-09 01:45:10 -05:00
Al
197b18f3cf [fix] NULL check 2016-01-09 01:43:25 -05:00
Al
9c4b5ccbb1 [math] Adding array_{op}_times_scalar methods 2016-01-09 01:42:54 -05:00
Al
2f1e2139ca [math] Unique columns as array for CSR sparse matrix 2016-01-09 01:40:26 -05:00
Al
023c04d78f [classification] Pre-allocating memory in logistic regression trainer, storing last updated timestamps for sparse stochastic gradient descent and using the new gradient API 2016-01-09 01:39:24 -05:00
Al
562cc06eaf [classification] Sparse version of logistic regression gradient which, given an array of the features/columns used in the input batch, only updates the gradient for that batch, even for the operations which otherwise would apply to the entire matrix (scaling by -1/m, regularization) 2016-01-09 01:33:33 -05:00
Al
5ca4bba1d5 [fix] Writing matrix dimension as 64-bit 2016-01-08 01:29:52 -05:00
Al
8f054eeeb1 [classification] Training structures for logistic regression and stochastic (minibatch) gradient descent update 2016-01-08 01:07:20 -05:00
Al
4acf10c3a4 [classification] Multinomial logistic regression, gradient and cost function 2016-01-08 01:03:09 -05:00