0bad3adf07[docs] Removing the coming soon label from language classification, cleaning up the README a bit
Al
2016-01-27 14:44:48 -05:00
95a7978131[build] Adding relevant language_classifier sources to build
Al
2016-01-27 03:34:35 -05:00
93ed2bf15b[api] Making language optional in libpostal cli
Al
2016-01-27 03:32:29 -05:00
789db8f582[build] Adding language classifier to data file download script. As the current file is rather large, added multipart downloads from S3 to speed things up
Al
2016-01-27 03:31:45 -05:00
42d169feee[api] Libpostal expand API will now detect language automatically using a high accuracy language classifier trained on OSM streets/addresses/toponyms. Hooray batch geocoding!
Al
2016-01-27 03:20:55 -05:00
71c51f2e45[language_classification] Making directory optional on language_classifier client/test program
Al
2016-01-27 03:18:53 -05:00
c770468d03[expansion] Regenerated address_expansion_data.c
Al
2016-01-27 03:17:59 -05:00
36f52d9707[fix] Removing feature printing
Al
2016-01-26 15:34:56 -05:00
239f8adec6[docs] README updates now that the Python repo is separate
Al
2016-01-26 02:40:07 -05:00
5077462754[fix] temporary files for language classifier training
Al
2016-01-26 01:42:21 -05:00
426edccbf8[language_classification] Simple accuracy-based test program for language classifier.
Al
2016-01-26 01:27:55 -05:00
9abbf42bf4[language_classifier] Command-line client for language classification
Al
2016-01-26 01:20:59 -05:00
314b65e192[build] Adding shuffle.c to language_classifier_train
Al
2016-01-26 01:18:35 -05:00
ababb8f2d0[fix] sign comparison in regularized gradient computation for logistic regression
Al
2016-01-26 01:16:11 -05:00
ae2b839f17[build] Adding language classifier train/test/cli programs to the build
Al
2016-01-26 00:09:02 -05:00
299998d8b5[languages] Making Basque the only default in the Basque region.
Al
2016-01-24 19:35:03 -05:00
b4dcb83e10[fix] sets of potential languages in case phrase matches multiple dictionaries
Al
2016-01-24 17:57:12 -05:00
b713d102d1[languages] using whole phrase len, not first token, in disambiguation. Using single unambiguous observed default language or unambiguous observed language
Al
2016-01-24 17:43:14 -05:00
b3e730d83f[languages] If there's a single default language, assume ambiguous abbreviations are the default
Al
2016-01-24 17:15:02 -05:00
fffaeecfc6[languages] Only count regional defaults when returning languages
Al
2016-01-24 16:35:14 -05:00
b735c79326[languages] Adding Spanish in as a secondary default in Spain to supplement regional language defaults so we're more careful in disambiguation
Al
2016-01-24 16:34:23 -05:00
f8a0463aa0[languages] Language disambiguation treats the national languages as non-default
Al
2016-01-24 15:09:51 -05:00
87aff60a7e[dictionaries] Gulch
Al
2016-01-24 03:23:40 -05:00
f04360732c[languages] Single character cannot be sufficient to disambiguate with multiple languages (Avenue A for example)
Al
2016-01-24 03:17:18 -05:00
cb914ae85b[dictionaries] Adding a few terms to English dictionaries for automated disambiguation in the US/Canada
Al
2016-01-24 03:15:10 -05:00
00ce71223f[osm] Using the default probabilities for abbreviations in ways training data
Al
2016-01-24 00:53:41 -05:00
bab7a0f961[osm] splitting streets (way names) on semicolons
Al
2016-01-24 00:42:25 -05:00
3485738c2b[fix] regional languages in French Canada
Al
2016-01-24 00:20:34 -05:00
7646adfc0f[osm] Adding abbreviated street names in addition to the originals
Al
2016-01-23 23:23:58 -05:00
67130383ce[fix] converting semicolons to commas in OSM house numbers and picking one at random
Al
2016-01-23 23:16:19 -05:00
1bb797f783[fix] spacing in phrases
Al
2016-01-23 21:59:49 -05:00
3a8c3dfcf6[fix] spacing in phrases at end of string
Al
2016-01-23 21:51:40 -05:00
78450bfad9[fix] Spaces in abbreviation
Al
2016-01-23 21:36:20 -05:00
308ceb5a5f[fix] convert UTF8 slices back to unicode before using with the Python trie
Al
2016-01-23 20:20:23 -05:00
5eb6bb309b[fix] Only adding whitespace back into tokenized strings during abbreviation if it existed in the original string
Al
2016-01-23 20:09:45 -05:00
d61207e95a[fix] var name
Al
2016-01-23 18:01:02 -05:00
e44cba1d06[fix] geonames db not required in OSM training data
Al
2016-01-23 17:59:55 -05:00
4f03711e60[osm] Adding abbreviated training examples to ways language training data
Al
2016-01-23 14:10:47 -05:00
c9fb4ee69d[osm/formatting] Dropping state more often than not, except in the US and Canada where those fields are more commonly used
Al
2016-01-22 17:58:18 -05:00
ea9bb3f2d5[fix] Abbreviation probabilities should only apply once, not once per dictionary. Also fixing issues where some of the abbreviations were doubled
Al
2016-01-22 15:48:21 -05:00
f9f6558e06[fix] simple whitespace field splits for the limited format training data (used for language classification)
Al
2016-01-22 04:34:36 -05:00
cd1db7b288[fix] Making sure rare components are dropped first, adding state and country back in
Al
2016-01-22 04:17:19 -05:00
adc3a00264[fix] var name
Al
2016-01-22 04:10:16 -05:00
261beffa36[fix] Actually better to remove country and state from rare components and let them use the standard dropout probabilities
Al
2016-01-22 04:00:45 -05:00
a6cc3d0114[fix] Adding state to the more frequently dropped components
Al
2016-01-22 03:56:38 -05:00
bca3dae004[fix] state full name probabilities for limited vs. full formatted OSM training sets
Al
2016-01-22 03:54:20 -05:00
d1cf253092[osm/formatting] Higher probability of dropout for rare components like counties, etc.
Al
2016-01-22 03:39:35 -05:00
9dd965a6fa[fix] removing gazetteer configuration from disambiguation module
Al
2016-01-22 03:18:18 -05:00
b22646ee30[mv] Moving gazetteers into their own module
Al
2016-01-22 03:15:56 -05:00
5a68e7aeef[fix] import
Al
2016-01-22 03:00:43 -05:00
6ac72576bc[osm/formatting] Randomly abbreviating street names and venue names using all the available libpostal dictionaries. Refactoring OSM formatting into separate methods which can be individually tested. Adding override for special phrases like UK
Al
2016-01-22 02:56:31 -05:00
f4995d4f0f[languages] Adding several different types of dictionaries for name expansion/abbreviation in OSM
Al
2016-01-22 00:51:32 -05:00
89aa039692[dictionaries] Adding some Italian month abbreviations
Al
2016-01-21 15:12:46 -05:00
26cbb1eb8d[languages] Fixing multiple expansions in the same dictionary for Python trie, adding length for prefixes/suffixes
Al
2016-01-21 04:29:14 -05:00
0269d92e3d[languages] Adding canonical string and dictionary type to Python trie, modifying disambiguate_languages accordingly, and adding lists of alternate forms
Al
2016-01-21 02:30:02 -05:00
2e15db06dd[text] making normalize_string directly callable from Python geodata
Al
2016-01-21 02:07:46 -05:00
71e01e6133[fix] prefix/suffix phrase search in Python trie search
Al
2016-01-19 03:43:51 -05:00
39667b73a2[build] std=gnu99 in geodata build
Al
2016-01-19 03:23:56 -05:00
8b94a018e6[languages] encoding in language disambiguation
Al
2016-01-19 03:22:03 -05:00
3262d2ccd3[fix] arg count
Al
2016-01-19 03:16:14 -05:00
5d5d5713cc[transliteration] Regenerating transliterator scripts
Al
2016-01-18 12:04:14 -05:00
fe8f3158f6[fix] missing file in geodata
Al
2016-01-17 22:23:44 -05:00
5fd9dc7e2b[scripts] relative dirs in setup.py for geodata
Al
2016-01-17 22:22:50 -05:00
da62ff309e[transliteration] Fixing Malayalam script
Al
2016-01-17 22:15:56 -05:00
5385cb71d6[languages] Adding English dictionaries to Indonesia
Al
2016-01-17 22:08:06 -05:00
8030b235e6[languages] Changing the definition in script languages so only languages that appear on street signs will be used
Al
2016-01-17 22:03:41 -05:00
0dfd8d6439[language_classification] Adding script feature for any non-Latin script. Even if the script doesn't directly identify the language, it can act as a modified intercept (all Han script addresses will share the Han feature, even if we haven't seen one of the > 80k Han characters)
Al
2016-01-17 21:37:45 -05:00
b9a3230f65[language_classification] Removing the per-country classifier, text-based alone is doing close to 99% accuracy now
Al
2016-01-17 21:13:14 -05:00
f808f74271[language_classification] Automatic hyperparameter optimization using either the cross-validation set or two distinct subsets of the training set
Al
2016-01-17 21:11:37 -05:00
af5689ee52[fix] removing unused var
Al
2016-01-17 21:00:12 -05:00
7d727fc8f0[optimization] Using adapted learning rate in stochastic gradient descent (if lambda > 0)
Al
2016-01-17 20:59:47 -05:00
7b300639f1[fix] Trie prefix search tail comparison
Al
2016-01-17 20:56:37 -05:00
70dbfdd560[unicode] Regenerating unicode_script_data.c
Al
2016-01-17 20:53:28 -05:00
de240d2b94[fix] tokenize_add_tokens respects specified length
Al
2016-01-17 20:51:43 -05:00
10cadc67d7[io] matrix_read using array I/O functions
Al
2016-01-17 20:40:18 -05:00
baba826d21[io] Cutting down on system calls in trie_read
Al
2016-01-17 20:39:19 -05:00
cba2acc21f[io] Sparse matrix using array I/O methods
Al
2016-01-17 20:38:16 -05:00
46b35c5202[utils] Adding functions to read numeric arrays from files
Al
2016-01-17 20:36:57 -05:00
3d7dd8966e[languages] Using unicode script in language disambiguation in addition to dictionaries. Eliminating dependency on address_normalizer
Al
2016-01-17 18:28:19 -05:00
fa32eacdd1[phrases] Adding Python phrase filter from address_normalizer until a Python wrapper around libpostal's trie_search is available
Al
2016-01-17 15:45:02 -05:00
f79a3c5bf4[osm/polygons] Allowing polygons that GEOS claims are invalid in OSM polygon index (there were some glaring omissions from the index like the polygons for the UK or Berlin). For some reason .buffer(0) creates weird multipolygons that no longer contain their centroids, etc. and aren't useful in reverese geocoding
Al
2016-01-17 15:43:21 -05:00
04f251c1cc[polygons] Don't call fix_polygon (force polygon validity) by default
Al
2016-01-16 21:21:27 -05:00
19a5541a85[polygons/osm] append polygon nodes by vertices that connect to each other
Al
2016-01-16 21:20:49 -05:00
d4143c1685[parsing] Adding an optimization to the parser API where, if the entire input is a single known geographic phrase like New York, it returns the most likely label from the training data. That way e.g. a search for 'Florida' doesn't get tagged as 'house.' This doesn't affect training, only prediction.
Al
2016-01-15 20:07:21 -05:00
24b4a680c3[languages] Adding English dictionaries for Bangladesh
Al
2016-01-14 13:36:07 -05:00
edebdf73e0[dictionaries] Using long forms as canonical for English degrees as new language models may do some auto-abbreviating
Al
2016-01-14 13:35:41 -05:00
58e53cab1c[scripts] Adding the tokenize/normalize wrappers directly into the internal geodata package so pypostal can be maintained in an independent repo
Al
2016-01-12 13:26:55 -05:00
622dc354e7[optimization] Adding learning rate to lazy sparse update in stochastic gradient descent
Al
2016-01-12 11:02:12 -05:00
79f2b7c192[build] Removing source from libpostal shared lib
Al
2016-01-12 10:31:19 -05:00
6a9c1e8c6d[build] Adding trie_utils.c to address parser train/test
Al
2016-01-12 10:22:30 -05:00
7cc201dec3[optimization] Moving gamma_t calculation to the header in SGD
Al
2016-01-11 16:40:50 -05:00
25ae5bed33[unicode] Adding SCRIPT_INHERITED as a common script so diacritics like COMBING CEDILLA don't break the current script and produce false word breaks
Al
2016-01-11 16:39:15 -05:00
3260edcf18[math] Adding sparse dot sparse given a dense output matrix (suitable for the minibatch use case), fixing sparse dot vector
Al
2016-01-11 13:55:54 -05:00
736bc7c70d[config] language_classifier data dir
Al
2016-01-10 03:05:36 -05:00
ebaedb6bcf[language_classifier] Language classifier training using L2-regularized logistic regression and stochastic gradient descent
Al
2016-01-10 01:31:18 -05:00
56710cce21[language_classifier] Language classifier data set I/O
Al
2016-01-10 01:22:29 -05:00