Commit Graph

184 Commits

Author SHA1 Message Date
Al
bca3dae004 [fix] state full name probabilities for limited vs. full formatted OSM training sets 2016-01-22 03:54:20 -05:00
Al
d1cf253092 [osm/formatting] Higher probability of dropout for rare components like counties, etc. 2016-01-22 03:39:35 -05:00
Al
b22646ee30 [mv] Moving gazetteers into their own module 2016-01-22 03:15:56 -05:00
Al
6ac72576bc [osm/formatting] Randomly abbreviating street names and venue names using all the available libpostal dictionaries. Refactoring OSM formatting into separate methods which can be individually tested. Adding override for special phrases like UK 2016-01-22 02:56:39 -05:00
Al
3262d2ccd3 [fix] arg count 2016-01-19 03:16:14 -05:00
Al
19a5541a85 [polygons/osm] append polygon nodes by vertices that connect to each other 2016-01-16 21:20:49 -05:00
Al
1d288954d7 [osm] Fixing an issue in the training data with house numbers in OSM (seen mostly in Uruguay) where a comma separated list of house numbers is entered. 2015-12-10 18:46:28 -05:00
Al
779298360c [osm] In cases with more than one official language and where the address language can be determined, use it for looking up language-specific OSM polygons 2015-12-09 01:00:59 -05:00
Al
aeb72d7d26 [osm] Randomly select up to n components for state_district OSM boundaries. For all other fields select one name at random 2015-12-09 00:20:20 -05:00
Al
69a469d9d3 [osm] Choosing a language at random in countries with multilingual addresses for the parser training data so we get some monolingual examples 2015-12-08 20:38:32 -05:00
Al
f8a3081d0f [fix] city name in OSM formatting 2015-12-07 02:33:12 -05:00
Al
b25a738000 [osm] Doing more deduping in the OSM training data to avoid confusing the parser when city, state, district all have the same name 2015-12-06 16:14:02 -05:00
Al
5fcb6d2c30 [fix] typo 2015-12-05 16:23:58 -05:00
Al
3a7ba0288f [fix] .get 2015-12-05 16:13:15 -05:00
Al
c92a6de477 [fix] name 2015-12-05 15:49:50 -05:00
Al
2a4210f93f [osm] Stripping standard city prefixes/suffies e.g. Township of 2015-12-05 15:42:22 -05:00
Al
f41158b8b3 [osm] Avoid using the alternate name (e.g. Brooklyn instead of Kings County) when it is the same as city 2015-12-05 14:21:07 -05:00
Al
7c26317903 [fix] osm components 2015-12-03 19:30:15 -05:00
Al
42a8890652 [osm] Only removing local language city if there are prior components from OSM 2015-12-03 19:11:03 -05:00
Al
5af95ee613 [osm] Adding GeoNames abbreviated city names in a small percentage of cases to get variations like NYC, BK, SF, etc. in the training data 2015-12-03 18:00:05 -05:00
Al
218361f43f [osm] Removing multilinestring boundaries from OSM polygon index (often partial boundaries e.g. France-Germany) 2015-12-03 00:51:09 -05:00
Al
8484d4fffd [fix] venue names should be removed probabilistically in the training data, giving neighborhoods a slightly better chance of being included 2015-11-30 23:28:12 -05:00
Al
6ef40c1769 [fix] dupe checking 2015-11-30 18:43:11 -05:00
Al
af170de019 [fix] Smaller probabilities on adding neighborhoods and admin polygons, eliminating duplicates on the row level 2015-11-30 18:35:31 -05:00
Al
621fd79002 [fix] var 2015-11-30 18:20:26 -05:00
Al
b430fb7657 [osm/formatting] Adding pick random name logic to neighborhoods as well, getting rid of drop probabilities as they're covered elsewhere, adding several forms of venue names to the training data 2015-11-30 18:10:18 -05:00
Al
839a12b212 [osm/formatting] Changing drop probabilities and doing it in random order 2015-11-30 15:27:35 -05:00
Al
89677d94a3 [parsing] Initial commit of the address parser, training/testing, feature function, I/O 2015-11-30 14:48:13 -05:00
Al
9a8ba14887 [osm/formatting] Adding per-field drop probabilities to OSM training data to make some fields more likely to be dropped, although it might create more training data 2015-11-30 11:10:12 -05:00
Al
15d9e00121 [osm/formatting] Adding in more ISO alpha-3 codes for countries in the training data 2015-11-28 14:08:07 -05:00
Al
66778737ff [fix] non-local language states 2015-11-28 13:48:59 -05:00
Al
69ba631dc9 [docs] updating params in OSM training data docs 2015-11-28 01:09:14 -05:00
Al
3cd1fee89d [fix] KeyError 2015-11-27 14:40:11 -05:00
Al
a77bc03977 [fix] language 2015-11-27 14:24:32 -05:00
Al
38d4e2d67a [fix] cities 2015-11-27 14:05:53 -05:00
Al
3cf98770e3 [fix] var name 2015-11-27 13:54:38 -05:00
Al
2e0f35b13a [fix] key checks for Quattroshapes cities, removing city in non-local language case 2015-11-27 13:45:51 -05:00
Al
105ba313c5 [fix] var name 2015-11-27 12:00:11 -05:00
Al
3eea355352 [fix] argument order 2015-11-27 11:47:39 -05:00
Al
51f6a82727 [fix] import again 2015-11-27 11:38:40 -05:00
Al
644eeb74c6 [fix] import 2015-11-27 11:17:53 -05:00
Al
2830986073 [osm/formatting] Adding in cities from Quattroshapes/GeoNames in the case of non-local languages or in general with a small random probability 2015-11-27 11:09:12 -05:00
Al
a50c971732 [polygons/osm] Ommitting last node in every way of a connected component since that node is equal to the start node of its neighbor 2015-11-25 17:09:19 -05:00
Al
3217fa39cd [fix] add country randomly in the formatted language training data in cases where country is not present 2015-11-25 14:54:41 -05:00
Al
5781813cbd [fix] For countries like Denmark, removing country with a smaller probability 2015-11-25 00:39:52 -05:00
Al
e4b8349d98 [fix] sparsity of country tags should be enough for language address training data 2015-11-25 00:32:01 -05:00
Al
824c779107 [fix] Cutting down training repeatedly on country names 2015-11-24 23:22:57 -05:00
Al
88529d28e2 [fix] country formatting in language address training data 2015-11-24 23:20:31 -05:00
Al
cd74fcda3c [fix] not requiring minimal keys in format language data 2015-11-24 23:13:28 -05:00
Al
e560e53308 [fix] formatter 2015-11-24 22:27:57 -05:00