Commit Graph

690 Commits

Author SHA1 Message Date
Al
d8763e9d6c [languages] Adding non-canonicals only for streets, prefixes and suffixes. Better handling of default langauges, abbreviations and ambiguity 2015-08-23 03:42:24 -04:00
Al
9c176961ff [dictionaries] Norwegian street types from the suffix dictionary 2015-08-23 02:32:44 -04:00
Al
122a81b610 [languages] non-default languages can still be labeled from > 1 char abbreviations if there's no evidence of other languages in the string. Adding Python version of get_string_script from the C lib 2015-08-23 02:26:06 -04:00
Al
a419dad630 [languages] Adding canonical back in to language disambiguation (for prefixes/suffixes too), using non-canonicals/abbreviations in non-default languages if there are no other abbreviations found, adding in stopwords dictionaries 2015-08-23 00:43:37 -04:00
Al
a7d9cc1782 [fix] No longer using abbreviations for default languages, can be stopwords, etc. 2015-08-22 23:34:15 -04:00
Al
0701bb6f08 [fix] import 2015-08-22 23:19:43 -04:00
Al
723058886a [languages] Disambiguation uses language defaults, unicode normalized canonicals are treated as canonicals 2015-08-22 23:18:09 -04:00
Al
6231e17f2b [languages] Disambiguation in language labeling better handles default languages and only uses canonical forms for non-default languages 2015-08-22 20:26:39 -04:00
Al
bf829f7cb6 [polygons] Adding a main to generate language polygons 2015-08-22 17:45:04 -04:00
Al
5c15c4a99f [languages] Adding non-default Spanish and French gazetteers to the US, and giving the country of Jersey shared English/French defaults instead of just English 2015-08-22 15:21:04 -04:00
Al
e70c2453ee [fix] import 2015-08-22 15:04:30 -04:00
Al
3902715258 [osm] Some countries like Lebanon in OSM will list the same address under two languages (French/English), which creates an unreasonable task for a linear classifier, so running disambiguation in those cases 2015-08-22 14:11:49 -04:00
Al
f6e521e3f3 [geonames] Adding covering index to geonames DB 2015-08-22 13:54:25 -04:00
Al
bd31dc99f2 [mv] csv_utils 2015-08-22 13:53:44 -04:00
Al
cc43409b72 [languages] Adding English gazetteers to many countries where the default language is Arabic but the road signs may be in English 2015-08-22 13:42:31 -04:00
Al
c5a9c392d4 [languages] Refactorying street_types_gazetteer a bit so dictionaries are configurable 2015-08-21 09:23:05 -04:00
Al
baa60aab65 [fix] language dismabiguation module 2015-08-21 08:03:20 -04:00
Al
4976be64e5 [fix] var name 2015-08-21 08:02:26 -04:00
Al
8e56568cab [fix] typo 2015-08-21 08:01:49 -04:00
Al
ca6d802a43 [languages] Moving language id methods into a separate package 2015-08-21 08:00:56 -04:00
Al
9d2f7e4bd1 [fix] var name 2015-08-18 16:20:12 -04:00
Al
0528d1b578 [osm] OSM untagged formatted addresses try to use language namespaced tags 2015-08-18 16:18:27 -04:00
Al
330002197a [fix] via in English is a stopword, not a street type 2015-08-18 16:00:48 -04:00
Al
c09cb4dd82 [osm] OSM untagged formatted addresses now use the new language labeling scheme 2015-08-18 15:13:10 -04:00
Al
3daba2ddcd [fix] removing debug print 2015-08-18 13:22:48 -04:00
Al
089a197155 [dictionaries] Updates to Galician and Catalan where they overlap with Spanish 2015-08-18 13:14:21 -04:00
Al
faf3435ffc [fix] English dictionaries 2015-08-18 12:40:09 -04:00
Al
9183ba4e01 [dictionaries] Accented Gran Via for Catalan 2015-08-18 12:39:40 -04:00
Al
07b43e524e [dictionaries] A few more Catalan terms that are the same as in Spanish 2015-08-18 12:23:11 -04:00
Al
ffe76f0403 [languages/osm] Checking for existence of separable prefix/suffix in the given dictionaries 2015-08-18 12:10:06 -04:00
Al
3b55b51ef1 [fix] English dictionary 2015-08-18 11:34:18 -04:00
Al
0e00625dbd [languages/osm] Adding a primitive phrase dictionary to the OSM training data construction script and a few heuristics to help disambiguate in the case of small local language groups that may not be specified with name:lang tags e.g. Occitan, Catalan, Basque, Galician, etc. Also throwing away ambiguous multilanguage names 2015-08-18 11:12:27 -04:00
Al
fb7f2999e5 [dictionaries] Moving a few terms in German dictionaries 2015-08-18 11:06:53 -04:00
Al
c5d14e9c4d [dictionaries] A few new terms in Dutch dictionaries to help distinguish from German 2015-08-18 11:06:10 -04:00
Al
4d115fdd88 [dictionaries] Better categorization of French dictionaries 2015-08-18 11:05:39 -04:00
Al
0f883a8872 [dictionaries] A few English dictionary terms that came up in language detection tests 2015-08-18 11:04:53 -04:00
Al
db7ffa7cab [dictionaries] Updating Catalan dictionaries with place types to help distinguish from Spanish 2015-08-18 11:03:44 -04:00
Al
a1d8d3bf5f [dictionaries] Fixes to Spanish dictionaries 2015-08-18 11:03:01 -04:00
Al
b72d9af7dc [fix] items 2015-08-18 04:17:34 -04:00
Al
f3bb3c8356 [fix] getter 2015-08-18 04:13:19 -04:00
Al
ebd5e96bd7 [fix] name 2015-08-18 04:05:04 -04:00
Al
b5be1e8df5 [fix] var name 2015-08-18 03:56:23 -04:00
Al
e84f932042 [fix] language polys 2015-08-18 03:51:30 -04:00
Al
bada7fd13b [polygons] Changes to languages polygons to support new regional language handling 2015-08-18 03:27:11 -04:00
Al
d97c725bbc [languages] Allowing specification of multiple regional languages 2015-08-18 03:18:52 -04:00
Al
b8fbbb1917 [languages] Removing the Belarusian override as Russian appears to be used often in street signs and there are generally good name:ru/name:be tags 2015-08-17 04:20:39 -04:00
Al
453aa7c633 [dictionaries] Adding French as equally likely language for Guernesey, which will effectively exclude it from the language training data (doesn't matter since there's already enough English/French addresses). 2015-08-17 02:04:29 -04:00
Al
89071ea21a [osm] Omitting country in limited address data set (often abbreviated, doesn't convey language as well) 2015-08-15 03:25:45 -04:00
Al
c505260912 [fix] var name 2015-08-15 02:47:31 -04:00
Al
548ce79b99 [fix] street addresses by language 2015-08-15 02:44:04 -04:00