Commit Graph

  • d14be57e73 [dictionaries] Adding exit as an English street type Al 2015-08-23 22:51:22 -04:00
  • 7053c6b60b [fix] language disambiguation Al 2015-08-23 22:50:24 -04:00
  • e26776a5e9 [dictionaries] Occitan stopwords for disambiguating from French Al 2015-08-23 16:35:46 -04:00
  • f6d84531bc [languages] If a non-Latin script in a string would prohibit the found language, return ambiguous. Adding some test cases for sanity checking the labeling Al 2015-08-23 16:34:10 -04:00
  • b8e4c19146 [mv] Moving the get regional/country languages logic out of language polygons Al 2015-08-23 14:25:33 -04:00
  • 43178747f8 [languages] Using stopwords only to account for how ambiguous a phrase is, not for disambiguation Al 2015-08-23 04:28:19 -04:00
  • d8763e9d6c [languages] Adding non-canonicals only for streets, prefixes and suffixes. Better handling of default langauges, abbreviations and ambiguity Al 2015-08-23 03:42:13 -04:00
  • 9c176961ff [dictionaries] Norwegian street types from the suffix dictionary Al 2015-08-23 02:32:31 -04:00
  • 122a81b610 [languages] non-default languages can still be labeled from > 1 char abbreviations if there's no evidence of other languages in the string. Adding Python version of get_string_script from the C lib Al 2015-08-23 02:24:32 -04:00
  • a419dad630 [languages] Adding canonical back in to language disambiguation (for prefixes/suffixes too), using non-canonicals/abbreviations in non-default languages if there are no other abbreviations found, adding in stopwords dictionaries Al 2015-08-23 00:43:37 -04:00
  • a7d9cc1782 [fix] No longer using abbreviations for default languages, can be stopwords, etc. Al 2015-08-22 23:34:15 -04:00
  • 0701bb6f08 [fix] import Al 2015-08-22 23:19:43 -04:00
  • 723058886a [languages] Disambiguation uses language defaults, unicode normalized canonicals are treated as canonicals Al 2015-08-22 21:13:07 -04:00
  • 6231e17f2b [languages] Disambiguation in language labeling better handles default languages and only uses canonical forms for non-default languages Al 2015-08-22 20:26:39 -04:00
  • bf829f7cb6 [polygons] Adding a main to generate language polygons Al 2015-08-22 17:45:00 -04:00
  • 5c15c4a99f [languages] Adding non-default Spanish and French gazetteers to the US, and giving the country of Jersey shared English/French defaults instead of just English Al 2015-08-22 15:21:04 -04:00
  • e70c2453ee [fix] import Al 2015-08-22 15:04:30 -04:00
  • 3902715258 [osm] Some countries like Lebanon in OSM will list the same address under two languages (French/English), which creates an unreasonable task for a linear classifier, so running disambiguation in those cases Al 2015-08-22 14:11:44 -04:00
  • f6e521e3f3 [geonames] Adding covering index to geonames DB Al 2015-08-22 13:54:25 -04:00
  • bd31dc99f2 [mv] csv_utils Al 2015-08-22 13:53:44 -04:00
  • cc43409b72 [languages] Adding English gazetteers to many countries where the default language is Arabic but the road signs may be in English Al 2015-08-22 13:42:31 -04:00
  • c5a9c392d4 [languages] Refactorying street_types_gazetteer a bit so dictionaries are configurable Al 2015-08-21 08:26:47 -04:00
  • baa60aab65 [fix] language dismabiguation module Al 2015-08-21 08:03:20 -04:00
  • 4976be64e5 [fix] var name Al 2015-08-21 08:02:26 -04:00
  • 8e56568cab [fix] typo Al 2015-08-21 08:01:49 -04:00
  • ca6d802a43 [languages] Moving language id methods into a separate package Al 2015-08-21 08:00:56 -04:00
  • 9d2f7e4bd1 [fix] var name Al 2015-08-18 16:20:12 -04:00
  • 0528d1b578 [osm] OSM untagged formatted addresses try to use language namespaced tags Al 2015-08-18 16:18:27 -04:00
  • 330002197a [fix] via in English is a stopword, not a street type Al 2015-08-18 16:00:48 -04:00
  • c09cb4dd82 [osm] OSM untagged formatted addresses now use the new language labeling scheme Al 2015-08-18 15:12:54 -04:00
  • 3daba2ddcd [fix] removing debug print Al 2015-08-18 13:22:44 -04:00
  • 089a197155 [dictionaries] Updates to Galician and Catalan where they overlap with Spanish Al 2015-08-18 13:09:05 -04:00
  • faf3435ffc [fix] English dictionaries Al 2015-08-18 12:40:09 -04:00
  • 9183ba4e01 [dictionaries] Accented Gran Via for Catalan Al 2015-08-18 12:39:40 -04:00
  • 07b43e524e [dictionaries] A few more Catalan terms that are the same as in Spanish Al 2015-08-18 12:23:11 -04:00
  • ffe76f0403 [languages/osm] Checking for existence of separable prefix/suffix in the given dictionaries Al 2015-08-18 12:10:06 -04:00
  • 3b55b51ef1 [fix] English dictionary Al 2015-08-18 11:34:18 -04:00
  • 0e00625dbd [languages/osm] Adding a primitive phrase dictionary to the OSM training data construction script and a few heuristics to help disambiguate in the case of small local language groups that may not be specified with name:lang tags e.g. Occitan, Catalan, Basque, Galician, etc. Also throwing away ambiguous multilanguage names Al 2015-08-18 11:12:27 -04:00
  • fb7f2999e5 [dictionaries] Moving a few terms in German dictionaries Al 2015-08-18 11:06:53 -04:00
  • c5d14e9c4d [dictionaries] A few new terms in Dutch dictionaries to help distinguish from German Al 2015-08-18 11:06:10 -04:00
  • 4d115fdd88 [dictionaries] Better categorization of French dictionaries Al 2015-08-18 11:05:39 -04:00
  • 0f883a8872 [dictionaries] A few English dictionary terms that came up in language detection tests Al 2015-08-18 11:04:50 -04:00
  • db7ffa7cab [dictionaries] Updating Catalan dictionaries with place types to help distinguish from Spanish Al 2015-08-18 11:03:44 -04:00
  • a1d8d3bf5f [dictionaries] Fixes to Spanish dictionaries Al 2015-08-18 11:03:01 -04:00
  • b72d9af7dc [fix] items Al 2015-08-18 04:17:34 -04:00
  • f3bb3c8356 [fix] getter Al 2015-08-18 04:13:19 -04:00
  • ebd5e96bd7 [fix] name Al 2015-08-18 04:05:04 -04:00
  • b5be1e8df5 [fix] var name Al 2015-08-18 03:56:23 -04:00
  • e84f932042 [fix] language polys Al 2015-08-18 03:51:30 -04:00
  • bada7fd13b [polygons] Changes to languages polygons to support new regional language handling Al 2015-08-18 03:27:11 -04:00
  • d97c725bbc [languages] Allowing specification of multiple regional languages Al 2015-08-18 03:18:52 -04:00
  • b8fbbb1917 [languages] Removing the Belarusian override as Russian appears to be used often in street signs and there are generally good name:ru/name:be tags Al 2015-08-17 04:20:39 -04:00
  • 453aa7c633 [dictionaries] Adding French as equally likely language for Guernesey, which will effectively exclude it from the language training data (doesn't matter since there's already enough English/French addresses). Al 2015-08-17 02:04:25 -04:00
  • 89071ea21a [osm] Omitting country in limited address data set (often abbreviated, doesn't convey language as well) Al 2015-08-15 03:25:45 -04:00
  • c505260912 [fix] var name Al 2015-08-15 02:47:31 -04:00
  • 548ce79b99 [fix] street addresses by language Al 2015-08-15 02:44:04 -04:00
  • 74a751ce0a [osm] Adding a new OSM training data option for writing out full formatted addresses without place names Al 2015-08-15 02:39:49 -04:00
  • 133ce9e5b1 [languages] Bonaire admin1 as well as country code Al 2015-08-14 21:42:13 -04:00
  • 05b8f555d5 [fix] language polygon index Al 2015-08-14 21:22:15 -04:00
  • 0e92abd53e [osm] Adding building tag to venues training set construction Al 2015-08-14 21:07:07 -04:00
  • 191c0e3ce5 [languages] Changing Bonaire's default road sign language to Papiamento to help distinguish from Dutch Al 2015-08-14 21:06:16 -04:00
  • cad1f95bbb [osm] Making minimal_only the default in formatted addresses, expanding list of acceptable combinations of address fields Al 2015-08-14 10:21:12 -04:00
  • 1e936ac9dc [fix] road+house_number as minimal keys for formatting addresses Al 2015-08-14 04:09:51 -04:00
  • 83bbd67c9c [fix] param Al 2015-08-14 00:57:17 -04:00
  • e993ddcb51 [fix] splitter Al 2015-08-14 00:54:06 -04:00
  • dc2766ae5d [fix] __init__ Al 2015-08-14 00:49:06 -04:00
  • 62c67aa970 [osm] Using pipe splitter for address components Al 2015-08-14 00:45:49 -04:00
  • 2bd763be03 [osm] Prefer amenity tag, skip if the building tag is simply building=yes Al 2015-08-13 21:16:34 -04:00
  • c844d0484a [fix] carriage returns Al 2015-08-13 21:07:12 -04:00
  • ef14aa2b7e [osm] Replacing escape chars at write time as there's no quoting, adding building key to venue training data Al 2015-08-13 19:30:39 -04:00
  • 9125f07af0 [polygons] Separating out simplify polygon into a method in RTree index Al 2015-08-13 18:43:32 -04:00
  • 46f2c68a69 [osm] Using tsv_no_quote writers in all OSM training data files Al 2015-08-13 18:40:41 -04:00
  • 9464670174 [scripts] Regenerating unicode_scripts_data file Al 2015-08-13 18:27:23 -04:00
  • 88d63c85d2 [utils] no-quote CSV dialect Al 2015-08-13 18:26:51 -04:00
  • 03febc7e20 [scripts] Better script code aliasing Al 2015-08-13 18:25:55 -04:00
  • b54ff95ecc [mv] csv_utils Al 2015-08-13 18:18:56 -04:00
  • 66a71ab70d [normalize] Need to do a Latin-ASCII transliteration even if the string is entirely ASCII since it may contain HTML escapes Al 2015-08-11 23:36:08 -04:00
  • 87b275fcab [transliteration] Regenerating transliteration data file Al 2015-08-11 23:11:17 -04:00
  • cf70615850 [transliteration] Doing HTML escapes first in Latin-ASCII transliteration as they may need to be resolved further in subsequent steps Al 2015-08-11 23:10:55 -04:00
  • 9712e0fa87 [fix] phrase start in transliteration Al 2015-08-11 23:09:49 -04:00
  • 562a7c243d [phrases] Fixing tail searches in trie_get_prefix* Al 2015-08-11 23:08:21 -04:00
  • 51addec5f2 [fix] check for local CLDR in unicode properties Al 2015-08-11 20:23:48 -04:00
  • 882e4c2ab8 [fix] ensure CLDR dir Al 2015-08-11 20:04:42 -04:00
  • 48566bf097 [fix] cldr languages dir Al 2015-08-11 20:04:25 -04:00
  • e98a822661 [build] ORder-only dependencies for downloading data files, rm-ing the tarball when done extracting Al 2015-08-11 12:59:37 -04:00
  • 0028c2bc53 [build] Fixing tarball uploading Al 2015-08-11 03:18:35 -04:00
  • f21b767696 [build] Adding tarball back to pkgdata Al 2015-08-10 18:44:40 -04:00
  • c29cf5ac9a [api] Better handling of strings with multiple scripts and strings that use more than one transliterator. Reducing complexity/allocations Al 2015-08-10 17:51:41 -04:00
  • 4bc6adf669 [normalize] Adding the original script as an alternative in transliteration mode as well Al 2015-08-10 17:48:48 -04:00
  • a13e5117b5 [utils] string_tree_num_strings method Al 2015-08-10 17:46:37 -04:00
  • 219947722d [cli] delete_word_hyphens as a default option Al 2015-08-10 16:19:54 -04:00
  • 78a80dd86e [api] Add separable or inseparable non-canonical string affixes (e.g. foobg. => fooburg, foostrasse => foostraße|foo straße, l'ensemble => l' ensemble, etc.) in expand_address Al 2015-08-10 16:19:03 -04:00
  • de5d6945b5 [expansion] Adding search_address_dictionaries_prefix/suffix for concatenated prefixes/suffixes e.g. in Germanic languages. Adding a flag to the address_expansion struct and trie value to denote separability, adding prefix/suffix keys during dictionary creation Al 2015-08-10 16:15:01 -04:00
  • 0f77ca1213 [normalize] Adding a char_array version of normalize token Al 2015-08-10 16:11:31 -04:00
  • 064b6b5898 [utils] char_array_append_reversed for adding reversed strings without a malloc Al 2015-08-10 16:10:05 -04:00
  • dab181a4d7 [fix] Only the exact TRIE_PREFIX_CHAR/TRIE_SUFFIX_CHAR characters are disallowed as keys Al 2015-08-10 16:09:10 -04:00
  • e511eede74 [phrases] Prefix/suffix trie search using the new characters, fixing length of matched prefixes/suffixes and exiting early on falling off the the trie Al 2015-08-10 16:02:38 -04:00
  • 51572d6575 [phrases] Changing prefix/suffix chars so both are control characters and neither is the NUL-byte. Modifying transliteration special characters accordingly Al 2015-08-10 16:01:22 -04:00
  • 11a9881988 [phrases] adding _from_index_get_prefix_char/_from_index_get_suffix_char methods Al 2015-08-09 03:38:28 -04:00
  • 2eb67ad850 [phrases] trie_search_prefixes/trie_search_suffixes now take a length param Al 2015-08-09 02:01:37 -04:00