Commit Graph

59 Commits

Author SHA1 Message Date
Al
3dc2a922fb [addresses/languages] if there's only one default language and we don't have a road name or a unicode script to disambiguate, assume the default (e.g. English in the US unless there's a Spanish/French road name). Can affect things like state abbreviations 2016-11-22 18:27:54 -05:00
Al
66e35d517d [fix] language disambiguation 2016-07-21 17:04:57 -04:00
Al
2a4f8c5634 [fix] set 2016-07-21 17:04:57 -04:00
Al
4c71cab6a0 [languages] Adding script-only disambiguation 2016-07-21 17:04:57 -04:00
Al
a0e6a828c9 [languages] Adding country_and_languages to the language rtree itself 2016-07-21 17:04:57 -04:00
Al
6703da8fc3 [fix] languages and disambiguation do initialization by default 2016-07-21 17:04:57 -04:00
Al
49ac3dc553 [disambiguation] Adding best_country_and_language 2016-07-21 17:04:57 -04:00
Al
7b42e52c6a [fix] token_types.PHRASE 2016-07-21 17:04:57 -04:00
Al
b4dcb83e10 [fix] sets of potential languages in case phrase matches multiple dictionaries 2016-01-24 17:57:12 -05:00
Al
b713d102d1 [languages] using whole phrase len, not first token, in disambiguation. Using single unambiguous observed default language or unambiguous observed language 2016-01-24 17:43:14 -05:00
Al
b3e730d83f [languages] If there's a single default language, assume ambiguous abbreviations are the default 2016-01-24 17:15:02 -05:00
Al
fffaeecfc6 [languages] Only count regional defaults when returning languages 2016-01-24 16:35:14 -05:00
Al
f8a0463aa0 [languages] Language disambiguation treats the national languages as non-default 2016-01-24 15:10:04 -05:00
Al
f04360732c [languages] Single character cannot be sufficient to disambiguate with multiple languages (Avenue A for example) 2016-01-24 03:17:21 -05:00
Al
3485738c2b [fix] regional languages in French Canada 2016-01-24 00:20:34 -05:00
Al
9dd965a6fa [fix] removing gazetteer configuration from disambiguation module 2016-01-22 03:18:18 -05:00
Al
b22646ee30 [mv] Moving gazetteers into their own module 2016-01-22 03:15:56 -05:00
Al
5a68e7aeef [fix] import 2016-01-22 03:00:43 -05:00
Al
f4995d4f0f [languages] Adding several different types of dictionaries for name expansion/abbreviation in OSM 2016-01-22 00:51:32 -05:00
Al
26cbb1eb8d [languages] Fixing multiple expansions in the same dictionary for Python trie, adding length for prefixes/suffixes 2016-01-21 04:29:14 -05:00
Al
0269d92e3d [languages] Adding canonical string and dictionary type to Python trie, modifying disambiguate_languages accordingly, and adding lists of alternate forms 2016-01-21 02:30:59 -05:00
Al
71e01e6133 [fix] prefix/suffix phrase search in Python trie search 2016-01-19 03:43:54 -05:00
Al
8b94a018e6 [languages] encoding in language disambiguation 2016-01-19 03:22:03 -05:00
Al
3d7dd8966e [languages] Using unicode script in language disambiguation in addition to dictionaries. Eliminating dependency on address_normalizer 2016-01-17 18:28:28 -05:00
Al
58e53cab1c [scripts] Adding the tokenize/normalize wrappers directly into the internal geodata package so pypostal can be maintained in an independent repo 2016-01-12 13:29:31 -05:00
Al
7ee8045a0f [fix] comparison 2015-11-22 18:27:05 -05:00
Al
efa0e38e45 [fix] another issue with tokenize API 2015-11-22 18:08:45 -05:00
Al
ce065bb9ec [fix] using new pypostal tokenize API 2015-11-22 18:01:07 -05:00
Al
ff3a3c2201 [fix] disambiguation tokenizer to pypostal 2015-10-21 16:35:55 -04:00
Al
7eb18f3538 [languages] Function to sample a random language from a discrete distribution (e.g. languages on the Internet, languages in a country, etc.) 2015-10-03 13:20:23 -04:00
Al
3ce1669c30 [fix] import 2015-09-24 01:25:00 -04:00
Al
8562c7a5cb [unicode] Adding wide char support for language disambiguation (comes up in venue names), despite the likelihood of running on a narrow Python build. Rolling back common script chars at a script break, so in the case of e.g. Cyrllic name (Latin name), the segmentation is done at the space before the paren. 2015-09-23 00:37:59 -04:00
Al
25917cfb17 [fix] scripts 2015-09-22 15:15:30 -04:00
Al
b405a53fe1 [fix] chars out of range in get_string_script Python version 2015-09-22 08:14:27 -04:00
Al
747de1944b [fix] Accounting for unknown scripts in disambiguation 2015-09-21 18:05:28 -04:00
Al
df20e2cbc0 [osm] Including toponyms in the training data for countries where the unqualified place names can be assumed to be examples of a given language 2015-09-04 14:13:33 -04:00
Al
6a20ce5e85 [language_id] Adding formatted addresses and toponyms to language training data 2015-09-04 01:46:49 -04:00
Al
8bbcb60aee [languages] Moving search_suffix and search_prefix into methods 2015-08-24 14:04:36 -04:00
Al
c68f56e61d [fix] paths 2015-08-24 12:58:27 -04:00
Al
d620cb6fc3 [fix] Calculating splits in Python rather than bash 2015-08-24 12:47:51 -04:00
Al
c754d275af [fix] str 2015-08-24 12:24:55 -04:00
Al
96cb289b79 [languages] Script to create language training/cross-validation/test data splits 2015-08-24 12:18:23 -04:00
Al
fa7b855ecb [languages] Earlier exit on finding ambiguous script spans 2015-08-24 03:07:57 -04:00
Al
e1d336716c [languages] Non-default language canonicals, more test cases 2015-08-24 02:21:53 -04:00
Al
c1ce91abbf [languages] Better handling of non-default langauge canonicals in default langauge text 2015-08-24 01:26:17 -04:00
Al
84e0982cbc [languages] Allow stopwords to help disambiguate if they can, otherwise ignore them 2015-08-23 23:04:17 -04:00
Al
7053c6b60b [fix] language disambiguation 2015-08-23 22:50:27 -04:00
Al
f6d84531bc [languages] If a non-Latin script in a string would prohibit the found language, return ambiguous. Adding some test cases for sanity checking the labeling 2015-08-23 16:34:26 -04:00
Al
43178747f8 [languages] Using stopwords only to account for how ambiguous a phrase is, not for disambiguation 2015-08-23 04:28:44 -04:00
Al
d8763e9d6c [languages] Adding non-canonicals only for streets, prefixes and suffixes. Better handling of default langauges, abbreviations and ambiguity 2015-08-23 03:42:24 -04:00