libpostal

Author	SHA1	Message	Date
Al	5e2d9f371e	[numex] Moving numex script to a different subpackage, adding function for creating ordinals	2016-07-21 17:04:57 -04:00
Al	e6b59980e7	[categories] Scraper for Nominatim Special Phrases, translated into a number of languages	2016-07-21 17:04:57 -04:00
Al	1bc92d6995	[fix] output path in numex.py	2016-03-29 11:25:36 -04:00
Al	2a2d1738a3	[fix] path for running numex.py	2016-03-29 11:15:24 -04:00
Al	7696179843	[osm] Removing generic amenities like ATMs, parking, restrooms, etc. from addresses but keeping them in venues to support generic queries	2016-03-14 01:07:03 -04:00
Al	18e2c7519e	[fix] Absolute dir check in generating expansion data files	2016-03-13 23:23:46 -04:00
Al	c5498c6c0c	[osm] Incorporating airports, and only including certain values for tourism= and leisure= since not all are physical place types, adding building= to addresses	2016-03-12 15:02:31 -05:00
Al Barrentine	942e5df1b9	Merge pull request #40 from thatdatabaseguy/master Including landmarks + more venues in OSM training data	2016-03-11 16:47:11 -05:00
Al	7a24ced43c	[fix] longitude validation	2016-03-11 16:35:33 -05:00
Al	99f452c7b1	[geo] Validate lat/lon in latlon_to_decimal	2016-03-11 16:18:31 -05:00
Al	a2f186a0ee	[geo] Adding lat/lon validation functions for the training scripts	2016-03-11 14:09:10 -05:00
Al	f7d6943994	[fix] no comma in download_quattroshapes filenames	2016-03-10 23:40:54 -05:00
Al	a71fa7bd8d	[osm] tourism= keys should only be included in some cases. Listing everything on taginfo with >= 100 uses	2016-03-10 14:17:38 -05:00
Al	d43fe201ff	[osm] No longer requiring street name in OSM planet addresses. Adding leisure and tourism keys to capture things like parks, squares, etc. Adding place=locality for neighborhoods.	2016-03-09 18:19:33 -05:00
Al	1003832b9c	[fix] README should not be included in building address dictionaries	2016-03-09 11:18:19 -05:00
Al	08085ee08b	[languages][ci skip] Checking in script to extract address phrases in various languages using frequent itemsets	2016-03-08 14:35:20 -05:00
Al	a483fd5d42	[fix][ci skip] pip installing some light requirements when the dictionaries/numex files change. Only building transliteration if the data file changed (the CLDR files are not in-repo so will be built offline)	2016-03-04 16:17:05 -05:00
Al	52ebc9fc46	[fix] Paths relative to the current file in address_dictionaries.py so it can be run from anywhere	2016-02-24 13:10:44 -05:00
Al	393fd7e0f3	[build] Using env var for data dir in geodata build script	2016-02-08 01:11:42 -05:00
Al	b4dcb83e10	[fix] sets of potential languages in case phrase matches multiple dictionaries	2016-01-24 17:57:12 -05:00
Al	b713d102d1	[languages] using whole phrase len, not first token, in disambiguation. Using single unambiguous observed default language or unambiguous observed language	2016-01-24 17:43:14 -05:00
Al	b3e730d83f	[languages] If there's a single default language, assume ambiguous abbreviations are the default	2016-01-24 17:15:02 -05:00
Al	fffaeecfc6	[languages] Only count regional defaults when returning languages	2016-01-24 16:35:14 -05:00
Al	f8a0463aa0	[languages] Language disambiguation treats the national languages as non-default	2016-01-24 15:10:04 -05:00
Al	f04360732c	[languages] Single character cannot be sufficient to disambiguate with multiple languages (Avenue A for example)	2016-01-24 03:17:21 -05:00
Al	00ce71223f	[osm] Using the default probabilities for abbreviations in ways training data	2016-01-24 00:53:41 -05:00
Al	bab7a0f961	[osm] splitting streets (way names) on semicolons	2016-01-24 00:42:25 -05:00
Al	3485738c2b	[fix] regional languages in French Canada	2016-01-24 00:20:34 -05:00
Al	7646adfc0f	[osm] Adding abbreviated street names in addition to the originals	2016-01-23 23:23:58 -05:00
Al	67130383ce	[fix] converting semicolons to commas in OSM house numbers and picking one at random	2016-01-23 23:16:19 -05:00
Al	1bb797f783	[fix] spacing in phrases	2016-01-23 21:59:49 -05:00
Al	3a8c3dfcf6	[fix] spacing in phrases at end of string	2016-01-23 21:51:40 -05:00
Al	78450bfad9	[fix] Spaces in abbreviation	2016-01-23 21:36:20 -05:00
Al	308ceb5a5f	[fix] convert UTF8 slices back to unicode before using with the Python trie	2016-01-23 20:20:23 -05:00
Al	5eb6bb309b	[fix] Only adding whitespace back into tokenized strings during abbreviation if it existed in the original string	2016-01-23 20:09:45 -05:00
Al	d61207e95a	[fix] var name	2016-01-23 18:01:02 -05:00
Al	e44cba1d06	[fix] geonames db not required in OSM training data	2016-01-23 17:59:55 -05:00
Al	4f03711e60	[osm] Adding abbreviated training examples to ways language training data	2016-01-23 14:10:47 -05:00
Al	c9fb4ee69d	[osm/formatting] Dropping state more often than not, except in the US and Canada where those fields are more commonly used	2016-01-22 17:58:24 -05:00
Al	ea9bb3f2d5	[fix] Abbreviation probabilities should only apply once, not once per dictionary. Also fixing issues where some of the abbreviations were doubled	2016-01-22 15:48:21 -05:00
Al	f9f6558e06	[fix] simple whitespace field splits for the limited format training data (used for language classification)	2016-01-22 04:34:42 -05:00
Al	cd1db7b288	[fix] Making sure rare components are dropped first, adding state and country back in	2016-01-22 04:17:19 -05:00
Al	adc3a00264	[fix] var name	2016-01-22 04:10:16 -05:00
Al	261beffa36	[fix] Actually better to remove country and state from rare components and let them use the standard dropout probabilities	2016-01-22 04:00:45 -05:00
Al	a6cc3d0114	[fix] Adding state to the more frequently dropped components	2016-01-22 03:56:38 -05:00
Al	bca3dae004	[fix] state full name probabilities for limited vs. full formatted OSM training sets	2016-01-22 03:54:20 -05:00
Al	d1cf253092	[osm/formatting] Higher probability of dropout for rare components like counties, etc.	2016-01-22 03:39:35 -05:00
Al	9dd965a6fa	[fix] removing gazetteer configuration from disambiguation module	2016-01-22 03:18:18 -05:00
Al	b22646ee30	[mv] Moving gazetteers into their own module	2016-01-22 03:15:56 -05:00
Al	5a68e7aeef	[fix] import	2016-01-22 03:00:43 -05:00

... 5 6 7 8 9 ...

825 Commits