Commit Graph

53 Commits

Author SHA1 Message Date
Al
49ac3dc553 [disambiguation] Adding best_country_and_language 2016-07-21 17:04:57 -04:00
Al
7b42e52c6a [fix] token_types.PHRASE 2016-07-21 17:04:57 -04:00
Al
b4dcb83e10 [fix] sets of potential languages in case phrase matches multiple dictionaries 2016-01-24 17:57:12 -05:00
Al
b713d102d1 [languages] using whole phrase len, not first token, in disambiguation. Using single unambiguous observed default language or unambiguous observed language 2016-01-24 17:43:14 -05:00
Al
b3e730d83f [languages] If there's a single default language, assume ambiguous abbreviations are the default 2016-01-24 17:15:02 -05:00
Al
fffaeecfc6 [languages] Only count regional defaults when returning languages 2016-01-24 16:35:14 -05:00
Al
f8a0463aa0 [languages] Language disambiguation treats the national languages as non-default 2016-01-24 15:10:04 -05:00
Al
f04360732c [languages] Single character cannot be sufficient to disambiguate with multiple languages (Avenue A for example) 2016-01-24 03:17:21 -05:00
Al
3485738c2b [fix] regional languages in French Canada 2016-01-24 00:20:34 -05:00
Al
9dd965a6fa [fix] removing gazetteer configuration from disambiguation module 2016-01-22 03:18:18 -05:00
Al
b22646ee30 [mv] Moving gazetteers into their own module 2016-01-22 03:15:56 -05:00
Al
5a68e7aeef [fix] import 2016-01-22 03:00:43 -05:00
Al
f4995d4f0f [languages] Adding several different types of dictionaries for name expansion/abbreviation in OSM 2016-01-22 00:51:32 -05:00
Al
26cbb1eb8d [languages] Fixing multiple expansions in the same dictionary for Python trie, adding length for prefixes/suffixes 2016-01-21 04:29:14 -05:00
Al
0269d92e3d [languages] Adding canonical string and dictionary type to Python trie, modifying disambiguate_languages accordingly, and adding lists of alternate forms 2016-01-21 02:30:59 -05:00
Al
71e01e6133 [fix] prefix/suffix phrase search in Python trie search 2016-01-19 03:43:54 -05:00
Al
8b94a018e6 [languages] encoding in language disambiguation 2016-01-19 03:22:03 -05:00
Al
3d7dd8966e [languages] Using unicode script in language disambiguation in addition to dictionaries. Eliminating dependency on address_normalizer 2016-01-17 18:28:28 -05:00
Al
58e53cab1c [scripts] Adding the tokenize/normalize wrappers directly into the internal geodata package so pypostal can be maintained in an independent repo 2016-01-12 13:29:31 -05:00
Al
7ee8045a0f [fix] comparison 2015-11-22 18:27:05 -05:00
Al
efa0e38e45 [fix] another issue with tokenize API 2015-11-22 18:08:45 -05:00
Al
ce065bb9ec [fix] using new pypostal tokenize API 2015-11-22 18:01:07 -05:00
Al
ff3a3c2201 [fix] disambiguation tokenizer to pypostal 2015-10-21 16:35:55 -04:00
Al
7eb18f3538 [languages] Function to sample a random language from a discrete distribution (e.g. languages on the Internet, languages in a country, etc.) 2015-10-03 13:20:23 -04:00
Al
3ce1669c30 [fix] import 2015-09-24 01:25:00 -04:00
Al
8562c7a5cb [unicode] Adding wide char support for language disambiguation (comes up in venue names), despite the likelihood of running on a narrow Python build. Rolling back common script chars at a script break, so in the case of e.g. Cyrllic name (Latin name), the segmentation is done at the space before the paren. 2015-09-23 00:37:59 -04:00
Al
25917cfb17 [fix] scripts 2015-09-22 15:15:30 -04:00
Al
b405a53fe1 [fix] chars out of range in get_string_script Python version 2015-09-22 08:14:27 -04:00
Al
747de1944b [fix] Accounting for unknown scripts in disambiguation 2015-09-21 18:05:28 -04:00
Al
df20e2cbc0 [osm] Including toponyms in the training data for countries where the unqualified place names can be assumed to be examples of a given language 2015-09-04 14:13:33 -04:00
Al
6a20ce5e85 [language_id] Adding formatted addresses and toponyms to language training data 2015-09-04 01:46:49 -04:00
Al
8bbcb60aee [languages] Moving search_suffix and search_prefix into methods 2015-08-24 14:04:36 -04:00
Al
c68f56e61d [fix] paths 2015-08-24 12:58:27 -04:00
Al
d620cb6fc3 [fix] Calculating splits in Python rather than bash 2015-08-24 12:47:51 -04:00
Al
c754d275af [fix] str 2015-08-24 12:24:55 -04:00
Al
96cb289b79 [languages] Script to create language training/cross-validation/test data splits 2015-08-24 12:18:23 -04:00
Al
fa7b855ecb [languages] Earlier exit on finding ambiguous script spans 2015-08-24 03:07:57 -04:00
Al
e1d336716c [languages] Non-default language canonicals, more test cases 2015-08-24 02:21:53 -04:00
Al
c1ce91abbf [languages] Better handling of non-default langauge canonicals in default langauge text 2015-08-24 01:26:17 -04:00
Al
84e0982cbc [languages] Allow stopwords to help disambiguate if they can, otherwise ignore them 2015-08-23 23:04:17 -04:00
Al
7053c6b60b [fix] language disambiguation 2015-08-23 22:50:27 -04:00
Al
f6d84531bc [languages] If a non-Latin script in a string would prohibit the found language, return ambiguous. Adding some test cases for sanity checking the labeling 2015-08-23 16:34:26 -04:00
Al
43178747f8 [languages] Using stopwords only to account for how ambiguous a phrase is, not for disambiguation 2015-08-23 04:28:44 -04:00
Al
d8763e9d6c [languages] Adding non-canonicals only for streets, prefixes and suffixes. Better handling of default langauges, abbreviations and ambiguity 2015-08-23 03:42:24 -04:00
Al
122a81b610 [languages] non-default languages can still be labeled from > 1 char abbreviations if there's no evidence of other languages in the string. Adding Python version of get_string_script from the C lib 2015-08-23 02:26:06 -04:00
Al
a419dad630 [languages] Adding canonical back in to language disambiguation (for prefixes/suffixes too), using non-canonicals/abbreviations in non-default languages if there are no other abbreviations found, adding in stopwords dictionaries 2015-08-23 00:43:37 -04:00
Al
a7d9cc1782 [fix] No longer using abbreviations for default languages, can be stopwords, etc. 2015-08-22 23:34:15 -04:00
Al
723058886a [languages] Disambiguation uses language defaults, unicode normalized canonicals are treated as canonicals 2015-08-22 23:18:09 -04:00
Al
6231e17f2b [languages] Disambiguation in language labeling better handles default languages and only uses canonical forms for non-default languages 2015-08-22 20:26:39 -04:00
Al
3902715258 [osm] Some countries like Lebanon in OSM will list the same address under two languages (French/English), which creates an unreasonable task for a linear classifier, so running disambiguation in those cases 2015-08-22 14:11:49 -04:00