[osm] Some countries like Lebanon in OSM will list the same address under two languages (French/English), which creates an unreasonable task for a linear classifier, so running disambiguation in those cases

This commit is contained in:
Al
2015-08-22 14:11:44 -04:00
parent f6e521e3f3
commit 3902715258
2 changed files with 28 additions and 12 deletions

View File

@@ -113,7 +113,7 @@ AMBIGUOUS_LANGUAGE = 'xxx'
def disambiguate_language(text, languages):
valid_languages = OrderedDict([(l['lang'], l['default']) for l in languages])
valid_languages = OrderedDict(languages)
tokens = tokenize(safe_decode(text).replace(u'-', u' ').lower())
current_lang = None