libpostal

Author	SHA1	Message	Date
Al	64db63e3eb	[osm] Removing house tag	2015-09-04 12:23:47 -04:00
Al	6a20ce5e85	[language_id] Adding formatted addresses and toponyms to language training data	2015-09-04 01:46:49 -04:00
Al	4ebdca0ea7	[fix] var	2015-09-03 21:01:20 -04:00
Al	8345afbcd0	[fix] exclude country toponyms where the default languages is well represented	2015-09-03 20:56:58 -04:00
Al	20bb191624	[fix] chaining	2015-09-03 20:52:00 -04:00
Al	e7cf5000fe	[fix] Exclude polygons with > 1 regional language	2015-09-03 20:48:04 -04:00
Al	9a9530c1b9	[fix] unqualified names	2015-09-03 20:37:22 -04:00
Al	a5fdd911d8	[fix] only use name key for default names	2015-09-03 20:35:08 -04:00
Al	d8e1432533	[osm] Adding unqualified names in single-language countries	2015-09-03 20:31:49 -04:00
Al	b15d2d70aa	[fix] top language	2015-09-03 20:09:46 -04:00
Al	44bf94a158	[osm] Better borders training data set (only need the metadata, not the polygons)	2015-09-03 20:09:03 -04:00
Al	55af9b0a0c	[fix] OSM address tagged training data formatting	2015-09-03 18:35:19 -04:00
Al	c6bfc0e021	[osm] Postponing punctuation stripping until after address template rendering	2015-09-03 18:13:41 -04:00
Al	d54fb25e45	[osm] don't bother with the R-tree check if there are no name:* tags in border data set	2015-09-03 17:54:40 -04:00
Al	33af61095b	[fix] var	2015-09-03 17:49:52 -04:00
Al	294101ad80	[osm] Treating components that are all punctuation as blank in address parsing (e.g. a single comma)	2015-09-03 17:46:57 -04:00
Al	e1e5c16637	[osm] Not adding unqualified name tags to toponym data set, throwing out a few cases of language ambiguity	2015-09-03 16:50:30 -04:00
Al	040a26a6f2	[fix] import	2015-09-03 13:54:23 -04:00
Al	7787427c58	[fix] typo	2015-09-03 13:53:18 -04:00
Al	23633e95dd	[osm] Only adding country default language toponyms to training data	2015-09-03 13:44:41 -04:00
Al	11c01f64d2	[osm] OrderedDict of attrs in OSM training data	2015-09-03 11:11:18 -04:00
Al	27eb4e4aed	[osm] Adding a toponym language training set using planet-borders.osm (all admin borders)	2015-09-03 10:19:11 -04:00
Al	db57855c95	[osm] Switching formatter repo to the OpenVenues fork, with fixes and several dozen new countries added	2015-09-03 10:06:54 -04:00
Al	a916668f28	[i18n] Local file for ISO 15924	2015-09-01 23:58:36 -04:00
Al	a2ec8001b0	[osm] Removing postal code keys in formatted language training data	2015-08-24 14:08:36 -04:00
Al	8bbcb60aee	[languages] Moving search_suffix and search_prefix into methods	2015-08-24 14:04:36 -04:00
Al	c68f56e61d	[fix] paths	2015-08-24 12:58:27 -04:00
Al	d620cb6fc3	[fix] Calculating splits in Python rather than bash	2015-08-24 12:47:51 -04:00
Al	c754d275af	[fix] str	2015-08-24 12:24:55 -04:00
Al	96cb289b79	[languages] Script to create language training/cross-validation/test data splits	2015-08-24 12:18:23 -04:00
Al	fa7b855ecb	[languages] Earlier exit on finding ambiguous script spans	2015-08-24 03:07:57 -04:00
Al	e1d336716c	[languages] Non-default language canonicals, more test cases	2015-08-24 02:21:53 -04:00
Al	c1ce91abbf	[languages] Better handling of non-default langauge canonicals in default langauge text	2015-08-24 01:26:17 -04:00
Al	96d7b990b5	[fix] .items()	2015-08-23 23:39:30 -04:00
Al	84e0982cbc	[languages] Allow stopwords to help disambiguate if they can, otherwise ignore them	2015-08-23 23:04:17 -04:00
Al	7053c6b60b	[fix] language disambiguation	2015-08-23 22:50:27 -04:00
Al	f6d84531bc	[languages] If a non-Latin script in a string would prohibit the found language, return ambiguous. Adding some test cases for sanity checking the labeling	2015-08-23 16:34:26 -04:00
Al	b8e4c19146	[mv] Moving the get regional/country languages logic out of language polygons	2015-08-23 14:25:33 -04:00
Al	43178747f8	[languages] Using stopwords only to account for how ambiguous a phrase is, not for disambiguation	2015-08-23 04:28:44 -04:00
Al	d8763e9d6c	[languages] Adding non-canonicals only for streets, prefixes and suffixes. Better handling of default langauges, abbreviations and ambiguity	2015-08-23 03:42:24 -04:00
Al	122a81b610	[languages] non-default languages can still be labeled from > 1 char abbreviations if there's no evidence of other languages in the string. Adding Python version of get_string_script from the C lib	2015-08-23 02:26:06 -04:00
Al	a419dad630	[languages] Adding canonical back in to language disambiguation (for prefixes/suffixes too), using non-canonicals/abbreviations in non-default languages if there are no other abbreviations found, adding in stopwords dictionaries	2015-08-23 00:43:37 -04:00
Al	a7d9cc1782	[fix] No longer using abbreviations for default languages, can be stopwords, etc.	2015-08-22 23:34:15 -04:00
Al	0701bb6f08	[fix] import	2015-08-22 23:19:43 -04:00
Al	723058886a	[languages] Disambiguation uses language defaults, unicode normalized canonicals are treated as canonicals	2015-08-22 23:18:09 -04:00
Al	6231e17f2b	[languages] Disambiguation in language labeling better handles default languages and only uses canonical forms for non-default languages	2015-08-22 20:26:39 -04:00
Al	bf829f7cb6	[polygons] Adding a main to generate language polygons	2015-08-22 17:45:04 -04:00
Al	e70c2453ee	[fix] import	2015-08-22 15:04:30 -04:00
Al	3902715258	[osm] Some countries like Lebanon in OSM will list the same address under two languages (French/English), which creates an unreasonable task for a linear classifier, so running disambiguation in those cases	2015-08-22 14:11:49 -04:00
Al	f6e521e3f3	[geonames] Adding covering index to geonames DB	2015-08-22 13:54:25 -04:00

1 2 3 4 5

214 Commits