Commit Graph

825 Commits

Author SHA1 Message Date
Al
5e2d9f371e [numex] Moving numex script to a different subpackage, adding function for creating ordinals 2016-07-21 17:04:57 -04:00
Al
e6b59980e7 [categories] Scraper for Nominatim Special Phrases, translated into a number of languages 2016-07-21 17:04:57 -04:00
Al
1bc92d6995 [fix] output path in numex.py 2016-03-29 11:25:36 -04:00
Al
2a2d1738a3 [fix] path for running numex.py 2016-03-29 11:15:24 -04:00
Al
7696179843 [osm] Removing generic amenities like ATMs, parking, restrooms, etc. from addresses but keeping them in venues to support generic queries 2016-03-14 01:07:03 -04:00
Al
18e2c7519e [fix] Absolute dir check in generating expansion data files 2016-03-13 23:23:46 -04:00
Al
c5498c6c0c [osm] Incorporating airports, and only including certain values for tourism= and leisure= since not all are physical place types, adding building= to addresses 2016-03-12 15:02:31 -05:00
Al Barrentine
942e5df1b9 Merge pull request #40 from thatdatabaseguy/master
Including landmarks + more venues in OSM training data
2016-03-11 16:47:11 -05:00
Al
7a24ced43c [fix] longitude validation 2016-03-11 16:35:33 -05:00
Al
99f452c7b1 [geo] Validate lat/lon in latlon_to_decimal 2016-03-11 16:18:31 -05:00
Al
a2f186a0ee [geo] Adding lat/lon validation functions for the training scripts 2016-03-11 14:09:10 -05:00
Al
f7d6943994 [fix] no comma in download_quattroshapes filenames 2016-03-10 23:40:54 -05:00
Al
a71fa7bd8d [osm] tourism= keys should only be included in some cases. Listing everything on taginfo with >= 100 uses 2016-03-10 14:17:38 -05:00
Al
d43fe201ff [osm] No longer requiring street name in OSM planet addresses. Adding leisure and tourism keys to capture things like parks, squares, etc. Adding place=locality for neighborhoods. 2016-03-09 18:19:33 -05:00
Al
1003832b9c [fix] README should not be included in building address dictionaries 2016-03-09 11:18:19 -05:00
Al
08085ee08b [languages][ci skip] Checking in script to extract address phrases in various languages using frequent itemsets 2016-03-08 14:35:20 -05:00
Al
a483fd5d42 [fix][ci skip] pip installing some light requirements when the dictionaries/numex files change. Only building transliteration if the data file changed (the CLDR files are not in-repo so will be built offline) 2016-03-04 16:17:05 -05:00
Al
52ebc9fc46 [fix] Paths relative to the current file in address_dictionaries.py so it can be run from anywhere 2016-02-24 13:10:44 -05:00
Al
393fd7e0f3 [build] Using env var for data dir in geodata build script 2016-02-08 01:11:42 -05:00
Al
b4dcb83e10 [fix] sets of potential languages in case phrase matches multiple dictionaries 2016-01-24 17:57:12 -05:00
Al
b713d102d1 [languages] using whole phrase len, not first token, in disambiguation. Using single unambiguous observed default language or unambiguous observed language 2016-01-24 17:43:14 -05:00
Al
b3e730d83f [languages] If there's a single default language, assume ambiguous abbreviations are the default 2016-01-24 17:15:02 -05:00
Al
fffaeecfc6 [languages] Only count regional defaults when returning languages 2016-01-24 16:35:14 -05:00
Al
f8a0463aa0 [languages] Language disambiguation treats the national languages as non-default 2016-01-24 15:10:04 -05:00
Al
f04360732c [languages] Single character cannot be sufficient to disambiguate with multiple languages (Avenue A for example) 2016-01-24 03:17:21 -05:00
Al
00ce71223f [osm] Using the default probabilities for abbreviations in ways training data 2016-01-24 00:53:41 -05:00
Al
bab7a0f961 [osm] splitting streets (way names) on semicolons 2016-01-24 00:42:25 -05:00
Al
3485738c2b [fix] regional languages in French Canada 2016-01-24 00:20:34 -05:00
Al
7646adfc0f [osm] Adding abbreviated street names in addition to the originals 2016-01-23 23:23:58 -05:00
Al
67130383ce [fix] converting semicolons to commas in OSM house numbers and picking one at random 2016-01-23 23:16:19 -05:00
Al
1bb797f783 [fix] spacing in phrases 2016-01-23 21:59:49 -05:00
Al
3a8c3dfcf6 [fix] spacing in phrases at end of string 2016-01-23 21:51:40 -05:00
Al
78450bfad9 [fix] Spaces in abbreviation 2016-01-23 21:36:20 -05:00
Al
308ceb5a5f [fix] convert UTF8 slices back to unicode before using with the Python trie 2016-01-23 20:20:23 -05:00
Al
5eb6bb309b [fix] Only adding whitespace back into tokenized strings during abbreviation if it existed in the original string 2016-01-23 20:09:45 -05:00
Al
d61207e95a [fix] var name 2016-01-23 18:01:02 -05:00
Al
e44cba1d06 [fix] geonames db not required in OSM training data 2016-01-23 17:59:55 -05:00
Al
4f03711e60 [osm] Adding abbreviated training examples to ways language training data 2016-01-23 14:10:47 -05:00
Al
c9fb4ee69d [osm/formatting] Dropping state more often than not, except in the US and Canada where those fields are more commonly used 2016-01-22 17:58:24 -05:00
Al
ea9bb3f2d5 [fix] Abbreviation probabilities should only apply once, not once per dictionary. Also fixing issues where some of the abbreviations were doubled 2016-01-22 15:48:21 -05:00
Al
f9f6558e06 [fix] simple whitespace field splits for the limited format training data (used for language classification) 2016-01-22 04:34:42 -05:00
Al
cd1db7b288 [fix] Making sure rare components are dropped first, adding state and country back in 2016-01-22 04:17:19 -05:00
Al
adc3a00264 [fix] var name 2016-01-22 04:10:16 -05:00
Al
261beffa36 [fix] Actually better to remove country and state from rare components and let them use the standard dropout probabilities 2016-01-22 04:00:45 -05:00
Al
a6cc3d0114 [fix] Adding state to the more frequently dropped components 2016-01-22 03:56:38 -05:00
Al
bca3dae004 [fix] state full name probabilities for limited vs. full formatted OSM training sets 2016-01-22 03:54:20 -05:00
Al
d1cf253092 [osm/formatting] Higher probability of dropout for rare components like counties, etc. 2016-01-22 03:39:35 -05:00
Al
9dd965a6fa [fix] removing gazetteer configuration from disambiguation module 2016-01-22 03:18:18 -05:00
Al
b22646ee30 [mv] Moving gazetteers into their own module 2016-01-22 03:15:56 -05:00
Al
5a68e7aeef [fix] import 2016-01-22 03:00:43 -05:00