Commit Graph

1360 Commits

Author SHA1 Message Date
Al
78450bfad9 [fix] Spaces in abbreviation 2016-01-23 21:36:20 -05:00
Al
308ceb5a5f [fix] convert UTF8 slices back to unicode before using with the Python trie 2016-01-23 20:20:23 -05:00
Al
5eb6bb309b [fix] Only adding whitespace back into tokenized strings during abbreviation if it existed in the original string 2016-01-23 20:09:45 -05:00
Al
d61207e95a [fix] var name 2016-01-23 18:01:02 -05:00
Al
e44cba1d06 [fix] geonames db not required in OSM training data 2016-01-23 17:59:55 -05:00
Al
4f03711e60 [osm] Adding abbreviated training examples to ways language training data 2016-01-23 14:10:47 -05:00
Al
c9fb4ee69d [osm/formatting] Dropping state more often than not, except in the US and Canada where those fields are more commonly used 2016-01-22 17:58:24 -05:00
Al
ea9bb3f2d5 [fix] Abbreviation probabilities should only apply once, not once per dictionary. Also fixing issues where some of the abbreviations were doubled 2016-01-22 15:48:21 -05:00
Al
f9f6558e06 [fix] simple whitespace field splits for the limited format training data (used for language classification) 2016-01-22 04:34:42 -05:00
Al
cd1db7b288 [fix] Making sure rare components are dropped first, adding state and country back in 2016-01-22 04:17:19 -05:00
Al
adc3a00264 [fix] var name 2016-01-22 04:10:16 -05:00
Al
261beffa36 [fix] Actually better to remove country and state from rare components and let them use the standard dropout probabilities 2016-01-22 04:00:45 -05:00
Al
a6cc3d0114 [fix] Adding state to the more frequently dropped components 2016-01-22 03:56:38 -05:00
Al
bca3dae004 [fix] state full name probabilities for limited vs. full formatted OSM training sets 2016-01-22 03:54:20 -05:00
Al
d1cf253092 [osm/formatting] Higher probability of dropout for rare components like counties, etc. 2016-01-22 03:39:35 -05:00
Al
9dd965a6fa [fix] removing gazetteer configuration from disambiguation module 2016-01-22 03:18:18 -05:00
Al
b22646ee30 [mv] Moving gazetteers into their own module 2016-01-22 03:15:56 -05:00
Al
5a68e7aeef [fix] import 2016-01-22 03:00:43 -05:00
Al
6ac72576bc [osm/formatting] Randomly abbreviating street names and venue names using all the available libpostal dictionaries. Refactoring OSM formatting into separate methods which can be individually tested. Adding override for special phrases like UK 2016-01-22 02:56:39 -05:00
Al
f4995d4f0f [languages] Adding several different types of dictionaries for name expansion/abbreviation in OSM 2016-01-22 00:51:32 -05:00
Al
89aa039692 [dictionaries] Adding some Italian month abbreviations 2016-01-21 15:12:46 -05:00
Al
26cbb1eb8d [languages] Fixing multiple expansions in the same dictionary for Python trie, adding length for prefixes/suffixes 2016-01-21 04:29:14 -05:00
Al
0269d92e3d [languages] Adding canonical string and dictionary type to Python trie, modifying disambiguate_languages accordingly, and adding lists of alternate forms 2016-01-21 02:30:59 -05:00
Al
2e15db06dd [text] making normalize_string directly callable from Python geodata 2016-01-21 02:07:46 -05:00
Al
71e01e6133 [fix] prefix/suffix phrase search in Python trie search 2016-01-19 03:43:54 -05:00
Al
39667b73a2 [build] std=gnu99 in geodata build 2016-01-19 03:23:56 -05:00
Al
8b94a018e6 [languages] encoding in language disambiguation 2016-01-19 03:22:03 -05:00
Al
3262d2ccd3 [fix] arg count 2016-01-19 03:16:14 -05:00
Al
5d5d5713cc [transliteration] Regenerating transliterator scripts 2016-01-18 12:04:14 -05:00
Al
fe8f3158f6 [fix] missing file in geodata 2016-01-17 22:23:44 -05:00
Al
5fd9dc7e2b [scripts] relative dirs in setup.py for geodata 2016-01-17 22:22:50 -05:00
Al
da62ff309e [transliteration] Fixing Malayalam script 2016-01-17 22:15:56 -05:00
Al
5385cb71d6 [languages] Adding English dictionaries to Indonesia 2016-01-17 22:08:06 -05:00
Al
8030b235e6 [languages] Changing the definition in script languages so only languages that appear on street signs will be used 2016-01-17 22:03:41 -05:00
Al
0dfd8d6439 [language_classification] Adding script feature for any non-Latin script. Even if the script doesn't directly identify the language, it can act as a modified intercept (all Han script addresses will share the Han feature, even if we haven't seen one of the > 80k Han characters) 2016-01-17 21:37:45 -05:00
Al
b9a3230f65 [language_classification] Removing the per-country classifier, text-based alone is doing close to 99% accuracy now 2016-01-17 21:13:14 -05:00
Al
f808f74271 [language_classification] Automatic hyperparameter optimization using either the cross-validation set or two distinct subsets of the training set 2016-01-17 21:11:37 -05:00
Al
af5689ee52 [fix] removing unused var 2016-01-17 21:00:17 -05:00
Al
7d727fc8f0 [optimization] Using adapted learning rate in stochastic gradient descent (if lambda > 0) 2016-01-17 20:59:47 -05:00
Al
7b300639f1 [fix] Trie prefix search tail comparison 2016-01-17 20:56:37 -05:00
Al
70dbfdd560 [unicode] Regenerating unicode_script_data.c 2016-01-17 20:53:44 -05:00
Al
de240d2b94 [fix] tokenize_add_tokens respects specified length 2016-01-17 20:51:47 -05:00
Al
10cadc67d7 [io] matrix_read using array I/O functions 2016-01-17 20:40:18 -05:00
Al
baba826d21 [io] Cutting down on system calls in trie_read 2016-01-17 20:39:19 -05:00
Al
cba2acc21f [io] Sparse matrix using array I/O methods 2016-01-17 20:38:16 -05:00
Al
46b35c5202 [utils] Adding functions to read numeric arrays from files 2016-01-17 20:36:57 -05:00
Al
3d7dd8966e [languages] Using unicode script in language disambiguation in addition to dictionaries. Eliminating dependency on address_normalizer 2016-01-17 18:28:28 -05:00
Al
fa32eacdd1 [phrases] Adding Python phrase filter from address_normalizer until a Python wrapper around libpostal's trie_search is available 2016-01-17 15:45:02 -05:00
Al
f79a3c5bf4 [osm/polygons] Allowing polygons that GEOS claims are invalid in OSM polygon index (there were some glaring omissions from the index like the polygons for the UK or Berlin). For some reason .buffer(0) creates weird multipolygons that no longer contain their centroids, etc. and aren't useful in reverese geocoding 2016-01-17 15:43:21 -05:00
Al
04f251c1cc [polygons] Don't call fix_polygon (force polygon validity) by default 2016-01-16 21:21:27 -05:00