libpostal

Author	SHA1	Message	Date
Al	49ac3dc553	[disambiguation] Adding best_country_and_language	2016-07-21 17:04:57 -04:00
Al	7b42e52c6a	[fix] token_types.PHRASE	2016-07-21 17:04:57 -04:00
Al	b4dcb83e10	[fix] sets of potential languages in case phrase matches multiple dictionaries	2016-01-24 17:57:12 -05:00
Al	b713d102d1	[languages] using whole phrase len, not first token, in disambiguation. Using single unambiguous observed default language or unambiguous observed language	2016-01-24 17:43:14 -05:00
Al	b3e730d83f	[languages] If there's a single default language, assume ambiguous abbreviations are the default	2016-01-24 17:15:02 -05:00
Al	fffaeecfc6	[languages] Only count regional defaults when returning languages	2016-01-24 16:35:14 -05:00
Al	f8a0463aa0	[languages] Language disambiguation treats the national languages as non-default	2016-01-24 15:10:04 -05:00
Al	f04360732c	[languages] Single character cannot be sufficient to disambiguate with multiple languages (Avenue A for example)	2016-01-24 03:17:21 -05:00
Al	3485738c2b	[fix] regional languages in French Canada	2016-01-24 00:20:34 -05:00
Al	9dd965a6fa	[fix] removing gazetteer configuration from disambiguation module	2016-01-22 03:18:18 -05:00
Al	b22646ee30	[mv] Moving gazetteers into their own module	2016-01-22 03:15:56 -05:00
Al	5a68e7aeef	[fix] import	2016-01-22 03:00:43 -05:00
Al	f4995d4f0f	[languages] Adding several different types of dictionaries for name expansion/abbreviation in OSM	2016-01-22 00:51:32 -05:00
Al	26cbb1eb8d	[languages] Fixing multiple expansions in the same dictionary for Python trie, adding length for prefixes/suffixes	2016-01-21 04:29:14 -05:00
Al	0269d92e3d	[languages] Adding canonical string and dictionary type to Python trie, modifying disambiguate_languages accordingly, and adding lists of alternate forms	2016-01-21 02:30:59 -05:00
Al	71e01e6133	[fix] prefix/suffix phrase search in Python trie search	2016-01-19 03:43:54 -05:00
Al	8b94a018e6	[languages] encoding in language disambiguation	2016-01-19 03:22:03 -05:00
Al	3d7dd8966e	[languages] Using unicode script in language disambiguation in addition to dictionaries. Eliminating dependency on address_normalizer	2016-01-17 18:28:28 -05:00
Al	58e53cab1c	[scripts] Adding the tokenize/normalize wrappers directly into the internal geodata package so pypostal can be maintained in an independent repo	2016-01-12 13:29:31 -05:00
Al	7ee8045a0f	[fix] comparison	2015-11-22 18:27:05 -05:00
Al	efa0e38e45	[fix] another issue with tokenize API	2015-11-22 18:08:45 -05:00
Al	ce065bb9ec	[fix] using new pypostal tokenize API	2015-11-22 18:01:07 -05:00
Al	ff3a3c2201	[fix] disambiguation tokenizer to pypostal	2015-10-21 16:35:55 -04:00
Al	7eb18f3538	[languages] Function to sample a random language from a discrete distribution (e.g. languages on the Internet, languages in a country, etc.)	2015-10-03 13:20:23 -04:00
Al	3ce1669c30	[fix] import	2015-09-24 01:25:00 -04:00
Al	8562c7a5cb	[unicode] Adding wide char support for language disambiguation (comes up in venue names), despite the likelihood of running on a narrow Python build. Rolling back common script chars at a script break, so in the case of e.g. Cyrllic name (Latin name), the segmentation is done at the space before the paren.	2015-09-23 00:37:59 -04:00
Al	25917cfb17	[fix] scripts	2015-09-22 15:15:30 -04:00
Al	b405a53fe1	[fix] chars out of range in get_string_script Python version	2015-09-22 08:14:27 -04:00
Al	747de1944b	[fix] Accounting for unknown scripts in disambiguation	2015-09-21 18:05:28 -04:00
Al	df20e2cbc0	[osm] Including toponyms in the training data for countries where the unqualified place names can be assumed to be examples of a given language	2015-09-04 14:13:33 -04:00
Al	6a20ce5e85	[language_id] Adding formatted addresses and toponyms to language training data	2015-09-04 01:46:49 -04:00
Al	8bbcb60aee	[languages] Moving search_suffix and search_prefix into methods	2015-08-24 14:04:36 -04:00
Al	c68f56e61d	[fix] paths	2015-08-24 12:58:27 -04:00
Al	d620cb6fc3	[fix] Calculating splits in Python rather than bash	2015-08-24 12:47:51 -04:00
Al	c754d275af	[fix] str	2015-08-24 12:24:55 -04:00
Al	96cb289b79	[languages] Script to create language training/cross-validation/test data splits	2015-08-24 12:18:23 -04:00
Al	fa7b855ecb	[languages] Earlier exit on finding ambiguous script spans	2015-08-24 03:07:57 -04:00
Al	e1d336716c	[languages] Non-default language canonicals, more test cases	2015-08-24 02:21:53 -04:00
Al	c1ce91abbf	[languages] Better handling of non-default langauge canonicals in default langauge text	2015-08-24 01:26:17 -04:00
Al	84e0982cbc	[languages] Allow stopwords to help disambiguate if they can, otherwise ignore them	2015-08-23 23:04:17 -04:00
Al	7053c6b60b	[fix] language disambiguation	2015-08-23 22:50:27 -04:00
Al	f6d84531bc	[languages] If a non-Latin script in a string would prohibit the found language, return ambiguous. Adding some test cases for sanity checking the labeling	2015-08-23 16:34:26 -04:00
Al	43178747f8	[languages] Using stopwords only to account for how ambiguous a phrase is, not for disambiguation	2015-08-23 04:28:44 -04:00
Al	d8763e9d6c	[languages] Adding non-canonicals only for streets, prefixes and suffixes. Better handling of default langauges, abbreviations and ambiguity	2015-08-23 03:42:24 -04:00
Al	122a81b610	[languages] non-default languages can still be labeled from > 1 char abbreviations if there's no evidence of other languages in the string. Adding Python version of get_string_script from the C lib	2015-08-23 02:26:06 -04:00
Al	a419dad630	[languages] Adding canonical back in to language disambiguation (for prefixes/suffixes too), using non-canonicals/abbreviations in non-default languages if there are no other abbreviations found, adding in stopwords dictionaries	2015-08-23 00:43:37 -04:00
Al	a7d9cc1782	[fix] No longer using abbreviations for default languages, can be stopwords, etc.	2015-08-22 23:34:15 -04:00
Al	723058886a	[languages] Disambiguation uses language defaults, unicode normalized canonicals are treated as canonicals	2015-08-22 23:18:09 -04:00
Al	6231e17f2b	[languages] Disambiguation in language labeling better handles default languages and only uses canonical forms for non-default languages	2015-08-22 20:26:39 -04:00
Al	3902715258	[osm] Some countries like Lebanon in OSM will list the same address under two languages (French/English), which creates an unreasonable task for a linear classifier, so running disambiguation in those cases	2015-08-22 14:11:49 -04:00

1 2

53 Commits