Al
|
df20e2cbc0
|
[osm] Including toponyms in the training data for countries where the unqualified place names can be assumed to be examples of a given language
|
2015-09-04 14:13:33 -04:00 |
|
Al
|
6a20ce5e85
|
[language_id] Adding formatted addresses and toponyms to language training data
|
2015-09-04 01:46:49 -04:00 |
|
Al
|
8bbcb60aee
|
[languages] Moving search_suffix and search_prefix into methods
|
2015-08-24 14:04:36 -04:00 |
|
Al
|
c68f56e61d
|
[fix] paths
|
2015-08-24 12:58:27 -04:00 |
|
Al
|
d620cb6fc3
|
[fix] Calculating splits in Python rather than bash
|
2015-08-24 12:47:51 -04:00 |
|
Al
|
c754d275af
|
[fix] str
|
2015-08-24 12:24:55 -04:00 |
|
Al
|
96cb289b79
|
[languages] Script to create language training/cross-validation/test data splits
|
2015-08-24 12:18:23 -04:00 |
|
Al
|
fa7b855ecb
|
[languages] Earlier exit on finding ambiguous script spans
|
2015-08-24 03:07:57 -04:00 |
|
Al
|
e1d336716c
|
[languages] Non-default language canonicals, more test cases
|
2015-08-24 02:21:53 -04:00 |
|
Al
|
c1ce91abbf
|
[languages] Better handling of non-default langauge canonicals in default langauge text
|
2015-08-24 01:26:17 -04:00 |
|
Al
|
84e0982cbc
|
[languages] Allow stopwords to help disambiguate if they can, otherwise ignore them
|
2015-08-23 23:04:17 -04:00 |
|
Al
|
7053c6b60b
|
[fix] language disambiguation
|
2015-08-23 22:50:27 -04:00 |
|
Al
|
f6d84531bc
|
[languages] If a non-Latin script in a string would prohibit the found language, return ambiguous. Adding some test cases for sanity checking the labeling
|
2015-08-23 16:34:26 -04:00 |
|
Al
|
43178747f8
|
[languages] Using stopwords only to account for how ambiguous a phrase is, not for disambiguation
|
2015-08-23 04:28:44 -04:00 |
|
Al
|
d8763e9d6c
|
[languages] Adding non-canonicals only for streets, prefixes and suffixes. Better handling of default langauges, abbreviations and ambiguity
|
2015-08-23 03:42:24 -04:00 |
|
Al
|
122a81b610
|
[languages] non-default languages can still be labeled from > 1 char abbreviations if there's no evidence of other languages in the string. Adding Python version of get_string_script from the C lib
|
2015-08-23 02:26:06 -04:00 |
|
Al
|
a419dad630
|
[languages] Adding canonical back in to language disambiguation (for prefixes/suffixes too), using non-canonicals/abbreviations in non-default languages if there are no other abbreviations found, adding in stopwords dictionaries
|
2015-08-23 00:43:37 -04:00 |
|
Al
|
a7d9cc1782
|
[fix] No longer using abbreviations for default languages, can be stopwords, etc.
|
2015-08-22 23:34:15 -04:00 |
|
Al
|
723058886a
|
[languages] Disambiguation uses language defaults, unicode normalized canonicals are treated as canonicals
|
2015-08-22 23:18:09 -04:00 |
|
Al
|
6231e17f2b
|
[languages] Disambiguation in language labeling better handles default languages and only uses canonical forms for non-default languages
|
2015-08-22 20:26:39 -04:00 |
|
Al
|
3902715258
|
[osm] Some countries like Lebanon in OSM will list the same address under two languages (French/English), which creates an unreasonable task for a linear classifier, so running disambiguation in those cases
|
2015-08-22 14:11:49 -04:00 |
|
Al
|
c5a9c392d4
|
[languages] Refactorying street_types_gazetteer a bit so dictionaries are configurable
|
2015-08-21 09:23:05 -04:00 |
|
Al
|
baa60aab65
|
[fix] language dismabiguation module
|
2015-08-21 08:03:20 -04:00 |
|
Al
|
ca6d802a43
|
[languages] Moving language id methods into a separate package
|
2015-08-21 08:00:56 -04:00 |
|