Commit Graph

566 Commits

Author SHA1 Message Date
Al
99f452c7b1 [geo] Validate lat/lon in latlon_to_decimal 2016-03-11 16:18:31 -05:00
Al
a2f186a0ee [geo] Adding lat/lon validation functions for the training scripts 2016-03-11 14:09:10 -05:00
Al
f7d6943994 [fix] no comma in download_quattroshapes filenames 2016-03-10 23:40:54 -05:00
Al
a71fa7bd8d [osm] tourism= keys should only be included in some cases. Listing everything on taginfo with >= 100 uses 2016-03-10 14:17:38 -05:00
Al
d43fe201ff [osm] No longer requiring street name in OSM planet addresses. Adding leisure and tourism keys to capture things like parks, squares, etc. Adding place=locality for neighborhoods. 2016-03-09 18:19:33 -05:00
Al
1003832b9c [fix] README should not be included in building address dictionaries 2016-03-09 11:18:19 -05:00
Al
08085ee08b [languages][ci skip] Checking in script to extract address phrases in various languages using frequent itemsets 2016-03-08 14:35:20 -05:00
Al
a483fd5d42 [fix][ci skip] pip installing some light requirements when the dictionaries/numex files change. Only building transliteration if the data file changed (the CLDR files are not in-repo so will be built offline) 2016-03-04 16:17:05 -05:00
Al
52ebc9fc46 [fix] Paths relative to the current file in address_dictionaries.py so it can be run from anywhere 2016-02-24 13:10:44 -05:00
Al
393fd7e0f3 [build] Using env var for data dir in geodata build script 2016-02-08 01:11:42 -05:00
Al
b4dcb83e10 [fix] sets of potential languages in case phrase matches multiple dictionaries 2016-01-24 17:57:12 -05:00
Al
b713d102d1 [languages] using whole phrase len, not first token, in disambiguation. Using single unambiguous observed default language or unambiguous observed language 2016-01-24 17:43:14 -05:00
Al
b3e730d83f [languages] If there's a single default language, assume ambiguous abbreviations are the default 2016-01-24 17:15:02 -05:00
Al
fffaeecfc6 [languages] Only count regional defaults when returning languages 2016-01-24 16:35:14 -05:00
Al
f8a0463aa0 [languages] Language disambiguation treats the national languages as non-default 2016-01-24 15:10:04 -05:00
Al
f04360732c [languages] Single character cannot be sufficient to disambiguate with multiple languages (Avenue A for example) 2016-01-24 03:17:21 -05:00
Al
00ce71223f [osm] Using the default probabilities for abbreviations in ways training data 2016-01-24 00:53:41 -05:00
Al
bab7a0f961 [osm] splitting streets (way names) on semicolons 2016-01-24 00:42:25 -05:00
Al
3485738c2b [fix] regional languages in French Canada 2016-01-24 00:20:34 -05:00
Al
7646adfc0f [osm] Adding abbreviated street names in addition to the originals 2016-01-23 23:23:58 -05:00
Al
67130383ce [fix] converting semicolons to commas in OSM house numbers and picking one at random 2016-01-23 23:16:19 -05:00
Al
1bb797f783 [fix] spacing in phrases 2016-01-23 21:59:49 -05:00
Al
3a8c3dfcf6 [fix] spacing in phrases at end of string 2016-01-23 21:51:40 -05:00
Al
78450bfad9 [fix] Spaces in abbreviation 2016-01-23 21:36:20 -05:00
Al
308ceb5a5f [fix] convert UTF8 slices back to unicode before using with the Python trie 2016-01-23 20:20:23 -05:00
Al
5eb6bb309b [fix] Only adding whitespace back into tokenized strings during abbreviation if it existed in the original string 2016-01-23 20:09:45 -05:00
Al
d61207e95a [fix] var name 2016-01-23 18:01:02 -05:00
Al
e44cba1d06 [fix] geonames db not required in OSM training data 2016-01-23 17:59:55 -05:00
Al
4f03711e60 [osm] Adding abbreviated training examples to ways language training data 2016-01-23 14:10:47 -05:00
Al
c9fb4ee69d [osm/formatting] Dropping state more often than not, except in the US and Canada where those fields are more commonly used 2016-01-22 17:58:24 -05:00
Al
ea9bb3f2d5 [fix] Abbreviation probabilities should only apply once, not once per dictionary. Also fixing issues where some of the abbreviations were doubled 2016-01-22 15:48:21 -05:00
Al
f9f6558e06 [fix] simple whitespace field splits for the limited format training data (used for language classification) 2016-01-22 04:34:42 -05:00
Al
cd1db7b288 [fix] Making sure rare components are dropped first, adding state and country back in 2016-01-22 04:17:19 -05:00
Al
adc3a00264 [fix] var name 2016-01-22 04:10:16 -05:00
Al
261beffa36 [fix] Actually better to remove country and state from rare components and let them use the standard dropout probabilities 2016-01-22 04:00:45 -05:00
Al
a6cc3d0114 [fix] Adding state to the more frequently dropped components 2016-01-22 03:56:38 -05:00
Al
bca3dae004 [fix] state full name probabilities for limited vs. full formatted OSM training sets 2016-01-22 03:54:20 -05:00
Al
d1cf253092 [osm/formatting] Higher probability of dropout for rare components like counties, etc. 2016-01-22 03:39:35 -05:00
Al
9dd965a6fa [fix] removing gazetteer configuration from disambiguation module 2016-01-22 03:18:18 -05:00
Al
b22646ee30 [mv] Moving gazetteers into their own module 2016-01-22 03:15:56 -05:00
Al
5a68e7aeef [fix] import 2016-01-22 03:00:43 -05:00
Al
6ac72576bc [osm/formatting] Randomly abbreviating street names and venue names using all the available libpostal dictionaries. Refactoring OSM formatting into separate methods which can be individually tested. Adding override for special phrases like UK 2016-01-22 02:56:39 -05:00
Al
f4995d4f0f [languages] Adding several different types of dictionaries for name expansion/abbreviation in OSM 2016-01-22 00:51:32 -05:00
Al
26cbb1eb8d [languages] Fixing multiple expansions in the same dictionary for Python trie, adding length for prefixes/suffixes 2016-01-21 04:29:14 -05:00
Al
0269d92e3d [languages] Adding canonical string and dictionary type to Python trie, modifying disambiguate_languages accordingly, and adding lists of alternate forms 2016-01-21 02:30:59 -05:00
Al
2e15db06dd [text] making normalize_string directly callable from Python geodata 2016-01-21 02:07:46 -05:00
Al
71e01e6133 [fix] prefix/suffix phrase search in Python trie search 2016-01-19 03:43:54 -05:00
Al
39667b73a2 [build] std=gnu99 in geodata build 2016-01-19 03:23:56 -05:00
Al
8b94a018e6 [languages] encoding in language disambiguation 2016-01-19 03:22:03 -05:00
Al
3262d2ccd3 [fix] arg count 2016-01-19 03:16:14 -05:00