Commit Graph

217 Commits

Author SHA1 Message Date
Al
72ee2e00ae [osm] Moving OSM boundaries to YAML files instead of JSON for consistency 2016-07-21 17:04:57 -04:00
Al
1f52f8ddcc [osm/polygons] Same check for closed ways as for relations in OSM polygon readers 2016-07-21 17:04:57 -04:00
Al
2f862ca0ec [osm] Adding place=plot to subdivisions data set 2016-07-21 17:04:57 -04:00
Al
8db7f139ba [osm] Adding building polygon reader, including closed ways for admin polys 2016-07-21 17:04:57 -04:00
Al
12a688df36 [osm] Splitting out generic amenities like ATM, fuel, restrooms, etc. so they can be used in category queries. Adding subdivision polygons, postcode polygons, building polygons, adding a few types of place keys to venues data set 2016-07-21 17:04:57 -04:00
Al
fc689222da [osm] adding civil boundaries (e.g. postal areas in Dublin), fixing output files 2016-07-21 17:04:57 -04:00
Al
2b4a9f0962 [osm] Splitting category queries data into several files (amenities, buildings, natural features, waterways) 2016-07-21 17:04:57 -04:00
Al
b25682e761 [polygons/zones] Adding a polygon reader for OSM zones (named residential/commercial/industrial/military areas) which are closed ways and can be used in addresses e.g. in office parks, larger housing complexes, etc. 2016-07-21 17:04:57 -04:00
Al
ac18e383bd [osm] Building OSM file for deriving category queries, zone data for including the names of residential, commercial and industrial areas in the parser. Named landuse and historic features are considered valid places/venues. 2016-07-21 17:04:57 -04:00
Al
af73bb300d [fix] Adding islands to admin borders 2016-07-21 17:04:57 -04:00
Al
7696179843 [osm] Removing generic amenities like ATMs, parking, restrooms, etc. from addresses but keeping them in venues to support generic queries 2016-03-14 01:07:03 -04:00
Al
c5498c6c0c [osm] Incorporating airports, and only including certain values for tourism= and leisure= since not all are physical place types, adding building= to addresses 2016-03-12 15:02:31 -05:00
Al
a71fa7bd8d [osm] tourism= keys should only be included in some cases. Listing everything on taginfo with >= 100 uses 2016-03-10 14:17:38 -05:00
Al
d43fe201ff [osm] No longer requiring street name in OSM planet addresses. Adding leisure and tourism keys to capture things like parks, squares, etc. Adding place=locality for neighborhoods. 2016-03-09 18:19:33 -05:00
Al
00ce71223f [osm] Using the default probabilities for abbreviations in ways training data 2016-01-24 00:53:41 -05:00
Al
bab7a0f961 [osm] splitting streets (way names) on semicolons 2016-01-24 00:42:25 -05:00
Al
7646adfc0f [osm] Adding abbreviated street names in addition to the originals 2016-01-23 23:23:58 -05:00
Al
67130383ce [fix] converting semicolons to commas in OSM house numbers and picking one at random 2016-01-23 23:16:19 -05:00
Al
1bb797f783 [fix] spacing in phrases 2016-01-23 21:59:49 -05:00
Al
3a8c3dfcf6 [fix] spacing in phrases at end of string 2016-01-23 21:51:40 -05:00
Al
78450bfad9 [fix] Spaces in abbreviation 2016-01-23 21:36:20 -05:00
Al
308ceb5a5f [fix] convert UTF8 slices back to unicode before using with the Python trie 2016-01-23 20:20:23 -05:00
Al
5eb6bb309b [fix] Only adding whitespace back into tokenized strings during abbreviation if it existed in the original string 2016-01-23 20:09:45 -05:00
Al
d61207e95a [fix] var name 2016-01-23 18:01:02 -05:00
Al
e44cba1d06 [fix] geonames db not required in OSM training data 2016-01-23 17:59:55 -05:00
Al
4f03711e60 [osm] Adding abbreviated training examples to ways language training data 2016-01-23 14:10:47 -05:00
Al
c9fb4ee69d [osm/formatting] Dropping state more often than not, except in the US and Canada where those fields are more commonly used 2016-01-22 17:58:24 -05:00
Al
ea9bb3f2d5 [fix] Abbreviation probabilities should only apply once, not once per dictionary. Also fixing issues where some of the abbreviations were doubled 2016-01-22 15:48:21 -05:00
Al
f9f6558e06 [fix] simple whitespace field splits for the limited format training data (used for language classification) 2016-01-22 04:34:42 -05:00
Al
cd1db7b288 [fix] Making sure rare components are dropped first, adding state and country back in 2016-01-22 04:17:19 -05:00
Al
adc3a00264 [fix] var name 2016-01-22 04:10:16 -05:00
Al
261beffa36 [fix] Actually better to remove country and state from rare components and let them use the standard dropout probabilities 2016-01-22 04:00:45 -05:00
Al
a6cc3d0114 [fix] Adding state to the more frequently dropped components 2016-01-22 03:56:38 -05:00
Al
bca3dae004 [fix] state full name probabilities for limited vs. full formatted OSM training sets 2016-01-22 03:54:20 -05:00
Al
d1cf253092 [osm/formatting] Higher probability of dropout for rare components like counties, etc. 2016-01-22 03:39:35 -05:00
Al
b22646ee30 [mv] Moving gazetteers into their own module 2016-01-22 03:15:56 -05:00
Al
6ac72576bc [osm/formatting] Randomly abbreviating street names and venue names using all the available libpostal dictionaries. Refactoring OSM formatting into separate methods which can be individually tested. Adding override for special phrases like UK 2016-01-22 02:56:39 -05:00
Al
3262d2ccd3 [fix] arg count 2016-01-19 03:16:14 -05:00
Al
19a5541a85 [polygons/osm] append polygon nodes by vertices that connect to each other 2016-01-16 21:20:49 -05:00
Al
1d288954d7 [osm] Fixing an issue in the training data with house numbers in OSM (seen mostly in Uruguay) where a comma separated list of house numbers is entered. 2015-12-10 18:46:28 -05:00
Al
779298360c [osm] In cases with more than one official language and where the address language can be determined, use it for looking up language-specific OSM polygons 2015-12-09 01:00:59 -05:00
Al
aeb72d7d26 [osm] Randomly select up to n components for state_district OSM boundaries. For all other fields select one name at random 2015-12-09 00:20:20 -05:00
Al
69a469d9d3 [osm] Choosing a language at random in countries with multilingual addresses for the parser training data so we get some monolingual examples 2015-12-08 20:38:32 -05:00
Al
f8a3081d0f [fix] city name in OSM formatting 2015-12-07 02:33:12 -05:00
Al
b25a738000 [osm] Doing more deduping in the OSM training data to avoid confusing the parser when city, state, district all have the same name 2015-12-06 16:14:02 -05:00
Al
5fcb6d2c30 [fix] typo 2015-12-05 16:23:58 -05:00
Al
3a7ba0288f [fix] .get 2015-12-05 16:13:15 -05:00
Al
c92a6de477 [fix] name 2015-12-05 15:49:50 -05:00
Al
2a4210f93f [osm] Stripping standard city prefixes/suffies e.g. Township of 2015-12-05 15:42:22 -05:00
Al
f41158b8b3 [osm] Avoid using the alternate name (e.g. Brooklyn instead of Kings County) when it is the same as city 2015-12-05 14:21:07 -05:00