libpostal

Author	SHA1	Message	Date
Al	11d1acc3bc	[parser] Sample chain store alternate names from the cross-language dictionary	2016-07-21 17:04:57 -04:00
Al	5ea570835e	[fix] args again	2016-07-21 17:04:57 -04:00
Al	7c41d84d8f	[fix] args	2016-07-21 17:04:57 -04:00
Al	2e4ba6e6cc	[subdivisions/buildings] Adding subdivisions and buildings rtree to training data for getting building height, zone	2016-07-21 17:04:57 -04:00
Al	91db1ec371	[fix] removing unnecessary vars	2016-07-21 17:04:57 -04:00
Al	bce7004ed7	[fix] import	2016-07-21 17:04:57 -04:00
Al	e57783ff5f	[fix] constructor	2016-07-21 17:04:57 -04:00
Al	677a86224e	[fix] cli arg name	2016-07-21 17:04:57 -04:00
Al	d04a026528	[fix] no need to init language, etc. in new script	2016-07-21 17:04:57 -04:00
Al	611002ea7a	[fix] cleaning up imports	2016-07-21 17:04:57 -04:00
Al	a96e5760a9	[osm] Same great training script, only shorter	2016-07-21 17:04:57 -04:00
Al	00ce71223f	[osm] Using the default probabilities for abbreviations in ways training data	2016-01-24 00:53:41 -05:00
Al	bab7a0f961	[osm] splitting streets (way names) on semicolons	2016-01-24 00:42:25 -05:00
Al	7646adfc0f	[osm] Adding abbreviated street names in addition to the originals	2016-01-23 23:23:58 -05:00
Al	67130383ce	[fix] converting semicolons to commas in OSM house numbers and picking one at random	2016-01-23 23:16:19 -05:00
Al	1bb797f783	[fix] spacing in phrases	2016-01-23 21:59:49 -05:00
Al	3a8c3dfcf6	[fix] spacing in phrases at end of string	2016-01-23 21:51:40 -05:00
Al	78450bfad9	[fix] Spaces in abbreviation	2016-01-23 21:36:20 -05:00
Al	308ceb5a5f	[fix] convert UTF8 slices back to unicode before using with the Python trie	2016-01-23 20:20:23 -05:00
Al	5eb6bb309b	[fix] Only adding whitespace back into tokenized strings during abbreviation if it existed in the original string	2016-01-23 20:09:45 -05:00
Al	d61207e95a	[fix] var name	2016-01-23 18:01:02 -05:00
Al	e44cba1d06	[fix] geonames db not required in OSM training data	2016-01-23 17:59:55 -05:00
Al	4f03711e60	[osm] Adding abbreviated training examples to ways language training data	2016-01-23 14:10:47 -05:00
Al	c9fb4ee69d	[osm/formatting] Dropping state more often than not, except in the US and Canada where those fields are more commonly used	2016-01-22 17:58:24 -05:00
Al	ea9bb3f2d5	[fix] Abbreviation probabilities should only apply once, not once per dictionary. Also fixing issues where some of the abbreviations were doubled	2016-01-22 15:48:21 -05:00
Al	f9f6558e06	[fix] simple whitespace field splits for the limited format training data (used for language classification)	2016-01-22 04:34:42 -05:00
Al	cd1db7b288	[fix] Making sure rare components are dropped first, adding state and country back in	2016-01-22 04:17:19 -05:00
Al	adc3a00264	[fix] var name	2016-01-22 04:10:16 -05:00
Al	261beffa36	[fix] Actually better to remove country and state from rare components and let them use the standard dropout probabilities	2016-01-22 04:00:45 -05:00
Al	a6cc3d0114	[fix] Adding state to the more frequently dropped components	2016-01-22 03:56:38 -05:00
Al	bca3dae004	[fix] state full name probabilities for limited vs. full formatted OSM training sets	2016-01-22 03:54:20 -05:00
Al	d1cf253092	[osm/formatting] Higher probability of dropout for rare components like counties, etc.	2016-01-22 03:39:35 -05:00
Al	b22646ee30	[mv] Moving gazetteers into their own module	2016-01-22 03:15:56 -05:00
Al	6ac72576bc	[osm/formatting] Randomly abbreviating street names and venue names using all the available libpostal dictionaries. Refactoring OSM formatting into separate methods which can be individually tested. Adding override for special phrases like UK	2016-01-22 02:56:39 -05:00
Al	1d288954d7	[osm] Fixing an issue in the training data with house numbers in OSM (seen mostly in Uruguay) where a comma separated list of house numbers is entered.	2015-12-10 18:46:28 -05:00
Al	779298360c	[osm] In cases with more than one official language and where the address language can be determined, use it for looking up language-specific OSM polygons	2015-12-09 01:00:59 -05:00
Al	aeb72d7d26	[osm] Randomly select up to n components for state_district OSM boundaries. For all other fields select one name at random	2015-12-09 00:20:20 -05:00
Al	69a469d9d3	[osm] Choosing a language at random in countries with multilingual addresses for the parser training data so we get some monolingual examples	2015-12-08 20:38:32 -05:00
Al	f8a3081d0f	[fix] city name in OSM formatting	2015-12-07 02:33:12 -05:00
Al	b25a738000	[osm] Doing more deduping in the OSM training data to avoid confusing the parser when city, state, district all have the same name	2015-12-06 16:14:02 -05:00
Al	5fcb6d2c30	[fix] typo	2015-12-05 16:23:58 -05:00
Al	3a7ba0288f	[fix] .get	2015-12-05 16:13:15 -05:00
Al	c92a6de477	[fix] name	2015-12-05 15:49:50 -05:00
Al	2a4210f93f	[osm] Stripping standard city prefixes/suffies e.g. Township of	2015-12-05 15:42:22 -05:00
Al	f41158b8b3	[osm] Avoid using the alternate name (e.g. Brooklyn instead of Kings County) when it is the same as city	2015-12-05 14:21:07 -05:00
Al	7c26317903	[fix] osm components	2015-12-03 19:30:15 -05:00
Al	42a8890652	[osm] Only removing local language city if there are prior components from OSM	2015-12-03 19:11:03 -05:00
Al	5af95ee613	[osm] Adding GeoNames abbreviated city names in a small percentage of cases to get variations like NYC, BK, SF, etc. in the training data	2015-12-03 18:00:05 -05:00
Al	8484d4fffd	[fix] venue names should be removed probabilistically in the training data, giving neighborhoods a slightly better chance of being included	2015-11-30 23:28:12 -05:00
Al	6ef40c1769	[fix] dupe checking	2015-11-30 18:43:11 -05:00

1 2 3 4

184 Commits