d14be57e73[dictionaries] Adding exit as an English street type
Al
2015-08-23 22:51:22 -04:00
7053c6b60b[fix] language disambiguation
Al
2015-08-23 22:50:24 -04:00
e26776a5e9[dictionaries] Occitan stopwords for disambiguating from French
Al
2015-08-23 16:35:46 -04:00
f6d84531bc[languages] If a non-Latin script in a string would prohibit the found language, return ambiguous. Adding some test cases for sanity checking the labeling
Al
2015-08-23 16:34:10 -04:00
b8e4c19146[mv] Moving the get regional/country languages logic out of language polygons
Al
2015-08-23 14:25:33 -04:00
43178747f8[languages] Using stopwords only to account for how ambiguous a phrase is, not for disambiguation
Al
2015-08-23 04:28:19 -04:00
d8763e9d6c[languages] Adding non-canonicals only for streets, prefixes and suffixes. Better handling of default langauges, abbreviations and ambiguity
Al
2015-08-23 03:42:13 -04:00
9c176961ff[dictionaries] Norwegian street types from the suffix dictionary
Al
2015-08-23 02:32:31 -04:00
122a81b610[languages] non-default languages can still be labeled from > 1 char abbreviations if there's no evidence of other languages in the string. Adding Python version of get_string_script from the C lib
Al
2015-08-23 02:24:32 -04:00
a419dad630[languages] Adding canonical back in to language disambiguation (for prefixes/suffixes too), using non-canonicals/abbreviations in non-default languages if there are no other abbreviations found, adding in stopwords dictionaries
Al
2015-08-23 00:43:37 -04:00
a7d9cc1782[fix] No longer using abbreviations for default languages, can be stopwords, etc.
Al
2015-08-22 23:34:15 -04:00
0701bb6f08[fix] import
Al
2015-08-22 23:19:43 -04:00
723058886a[languages] Disambiguation uses language defaults, unicode normalized canonicals are treated as canonicals
Al
2015-08-22 21:13:07 -04:00
6231e17f2b[languages] Disambiguation in language labeling better handles default languages and only uses canonical forms for non-default languages
Al
2015-08-22 20:26:39 -04:00
bf829f7cb6[polygons] Adding a main to generate language polygons
Al
2015-08-22 17:45:00 -04:00
5c15c4a99f[languages] Adding non-default Spanish and French gazetteers to the US, and giving the country of Jersey shared English/French defaults instead of just English
Al
2015-08-22 15:21:04 -04:00
e70c2453ee[fix] import
Al
2015-08-22 15:04:30 -04:00
3902715258[osm] Some countries like Lebanon in OSM will list the same address under two languages (French/English), which creates an unreasonable task for a linear classifier, so running disambiguation in those cases
Al
2015-08-22 14:11:44 -04:00
f6e521e3f3[geonames] Adding covering index to geonames DB
Al
2015-08-22 13:54:25 -04:00
bd31dc99f2[mv] csv_utils
Al
2015-08-22 13:53:44 -04:00
cc43409b72[languages] Adding English gazetteers to many countries where the default language is Arabic but the road signs may be in English
Al
2015-08-22 13:42:31 -04:00
c5a9c392d4[languages] Refactorying street_types_gazetteer a bit so dictionaries are configurable
Al
2015-08-21 08:26:47 -04:00
baa60aab65[fix] language dismabiguation module
Al
2015-08-21 08:03:20 -04:00
4976be64e5[fix] var name
Al
2015-08-21 08:02:26 -04:00
8e56568cab[fix] typo
Al
2015-08-21 08:01:49 -04:00
ca6d802a43[languages] Moving language id methods into a separate package
Al
2015-08-21 08:00:56 -04:00
9d2f7e4bd1[fix] var name
Al
2015-08-18 16:20:12 -04:00
0528d1b578[osm] OSM untagged formatted addresses try to use language namespaced tags
Al
2015-08-18 16:18:27 -04:00
330002197a[fix] via in English is a stopword, not a street type
Al
2015-08-18 16:00:48 -04:00
c09cb4dd82[osm] OSM untagged formatted addresses now use the new language labeling scheme
Al
2015-08-18 15:12:54 -04:00
3daba2ddcd[fix] removing debug print
Al
2015-08-18 13:22:44 -04:00
089a197155[dictionaries] Updates to Galician and Catalan where they overlap with Spanish
Al
2015-08-18 13:09:05 -04:00
faf3435ffc[fix] English dictionaries
Al
2015-08-18 12:40:09 -04:00
9183ba4e01[dictionaries] Accented Gran Via for Catalan
Al
2015-08-18 12:39:40 -04:00
07b43e524e[dictionaries] A few more Catalan terms that are the same as in Spanish
Al
2015-08-18 12:23:11 -04:00
ffe76f0403[languages/osm] Checking for existence of separable prefix/suffix in the given dictionaries
Al
2015-08-18 12:10:06 -04:00
3b55b51ef1[fix] English dictionary
Al
2015-08-18 11:34:18 -04:00
0e00625dbd[languages/osm] Adding a primitive phrase dictionary to the OSM training data construction script and a few heuristics to help disambiguate in the case of small local language groups that may not be specified with name:lang tags e.g. Occitan, Catalan, Basque, Galician, etc. Also throwing away ambiguous multilanguage names
Al
2015-08-18 11:12:27 -04:00
fb7f2999e5[dictionaries] Moving a few terms in German dictionaries
Al
2015-08-18 11:06:53 -04:00
c5d14e9c4d[dictionaries] A few new terms in Dutch dictionaries to help distinguish from German
Al
2015-08-18 11:06:10 -04:00
4d115fdd88[dictionaries] Better categorization of French dictionaries
Al
2015-08-18 11:05:39 -04:00
0f883a8872[dictionaries] A few English dictionary terms that came up in language detection tests
Al
2015-08-18 11:04:50 -04:00
db7ffa7cab[dictionaries] Updating Catalan dictionaries with place types to help distinguish from Spanish
Al
2015-08-18 11:03:44 -04:00
a1d8d3bf5f[dictionaries] Fixes to Spanish dictionaries
Al
2015-08-18 11:03:01 -04:00
b72d9af7dc[fix] items
Al
2015-08-18 04:17:34 -04:00
f3bb3c8356[fix] getter
Al
2015-08-18 04:13:19 -04:00
ebd5e96bd7[fix] name
Al
2015-08-18 04:05:04 -04:00
b5be1e8df5[fix] var name
Al
2015-08-18 03:56:23 -04:00
e84f932042[fix] language polys
Al
2015-08-18 03:51:30 -04:00
bada7fd13b[polygons] Changes to languages polygons to support new regional language handling
Al
2015-08-18 03:27:11 -04:00
d97c725bbc[languages] Allowing specification of multiple regional languages
Al
2015-08-18 03:18:52 -04:00
b8fbbb1917[languages] Removing the Belarusian override as Russian appears to be used often in street signs and there are generally good name:ru/name:be tags
Al
2015-08-17 04:20:39 -04:00
453aa7c633[dictionaries] Adding French as equally likely language for Guernesey, which will effectively exclude it from the language training data (doesn't matter since there's already enough English/French addresses).
Al
2015-08-17 02:04:25 -04:00
89071ea21a[osm] Omitting country in limited address data set (often abbreviated, doesn't convey language as well)
Al
2015-08-15 03:25:45 -04:00
c505260912[fix] var name
Al
2015-08-15 02:47:31 -04:00
548ce79b99[fix] street addresses by language
Al
2015-08-15 02:44:04 -04:00
74a751ce0a[osm] Adding a new OSM training data option for writing out full formatted addresses without place names
Al
2015-08-15 02:39:49 -04:00
133ce9e5b1[languages] Bonaire admin1 as well as country code
Al
2015-08-14 21:42:13 -04:00
05b8f555d5[fix] language polygon index
Al
2015-08-14 21:22:15 -04:00
0e92abd53e[osm] Adding building tag to venues training set construction
Al
2015-08-14 21:07:07 -04:00
191c0e3ce5[languages] Changing Bonaire's default road sign language to Papiamento to help distinguish from Dutch
Al
2015-08-14 21:06:16 -04:00
cad1f95bbb[osm] Making minimal_only the default in formatted addresses, expanding list of acceptable combinations of address fields
Al
2015-08-14 10:21:12 -04:00
1e936ac9dc[fix] road+house_number as minimal keys for formatting addresses
Al
2015-08-14 04:09:51 -04:00
83bbd67c9c[fix] param
Al
2015-08-14 00:57:17 -04:00
e993ddcb51[fix] splitter
Al
2015-08-14 00:54:06 -04:00
dc2766ae5d[fix] __init__
Al
2015-08-14 00:49:06 -04:00
62c67aa970[osm] Using pipe splitter for address components
Al
2015-08-14 00:45:49 -04:00
2bd763be03[osm] Prefer amenity tag, skip if the building tag is simply building=yes
Al
2015-08-13 21:16:34 -04:00
c844d0484a[fix] carriage returns
Al
2015-08-13 21:07:12 -04:00
ef14aa2b7e[osm] Replacing escape chars at write time as there's no quoting, adding building key to venue training data
Al
2015-08-13 19:30:39 -04:00
9125f07af0[polygons] Separating out simplify polygon into a method in RTree index
Al
2015-08-13 18:43:32 -04:00
46f2c68a69[osm] Using tsv_no_quote writers in all OSM training data files
Al
2015-08-13 18:40:41 -04:00
9464670174[scripts] Regenerating unicode_scripts_data file
Al
2015-08-13 18:27:23 -04:00
88d63c85d2[utils] no-quote CSV dialect
Al
2015-08-13 18:26:51 -04:00
03febc7e20[scripts] Better script code aliasing
Al
2015-08-13 18:25:55 -04:00
b54ff95ecc[mv] csv_utils
Al
2015-08-13 18:18:56 -04:00
66a71ab70d[normalize] Need to do a Latin-ASCII transliteration even if the string is entirely ASCII since it may contain HTML escapes
Al
2015-08-11 23:36:08 -04:00
87b275fcab[transliteration] Regenerating transliteration data file
Al
2015-08-11 23:11:17 -04:00
cf70615850[transliteration] Doing HTML escapes first in Latin-ASCII transliteration as they may need to be resolved further in subsequent steps
Al
2015-08-11 23:10:55 -04:00
9712e0fa87[fix] phrase start in transliteration
Al
2015-08-11 23:09:49 -04:00
562a7c243d[phrases] Fixing tail searches in trie_get_prefix*
Al
2015-08-11 23:08:21 -04:00
51addec5f2[fix] check for local CLDR in unicode properties
Al
2015-08-11 20:23:48 -04:00
882e4c2ab8[fix] ensure CLDR dir
Al
2015-08-11 20:04:42 -04:00
48566bf097[fix] cldr languages dir
Al
2015-08-11 20:04:25 -04:00
e98a822661[build] ORder-only dependencies for downloading data files, rm-ing the tarball when done extracting
Al
2015-08-11 12:59:37 -04:00
0028c2bc53[build] Fixing tarball uploading
Al
2015-08-11 03:18:35 -04:00
f21b767696[build] Adding tarball back to pkgdata
Al
2015-08-10 18:44:40 -04:00
c29cf5ac9a[api] Better handling of strings with multiple scripts and strings that use more than one transliterator. Reducing complexity/allocations
Al
2015-08-10 17:51:41 -04:00
4bc6adf669[normalize] Adding the original script as an alternative in transliteration mode as well
Al
2015-08-10 17:48:48 -04:00
a13e5117b5[utils] string_tree_num_strings method
Al
2015-08-10 17:46:37 -04:00
219947722d[cli] delete_word_hyphens as a default option
Al
2015-08-10 16:19:54 -04:00
78a80dd86e[api] Add separable or inseparable non-canonical string affixes (e.g. foobg. => fooburg, foostrasse => foostraße|foo straße, l'ensemble => l' ensemble, etc.) in expand_address
Al
2015-08-10 16:19:03 -04:00
de5d6945b5[expansion] Adding search_address_dictionaries_prefix/suffix for concatenated prefixes/suffixes e.g. in Germanic languages. Adding a flag to the address_expansion struct and trie value to denote separability, adding prefix/suffix keys during dictionary creation
Al
2015-08-10 16:15:01 -04:00
0f77ca1213[normalize] Adding a char_array version of normalize token
Al
2015-08-10 16:11:31 -04:00
064b6b5898[utils] char_array_append_reversed for adding reversed strings without a malloc
Al
2015-08-10 16:10:05 -04:00
dab181a4d7[fix] Only the exact TRIE_PREFIX_CHAR/TRIE_SUFFIX_CHAR characters are disallowed as keys
Al
2015-08-10 16:09:10 -04:00
e511eede74[phrases] Prefix/suffix trie search using the new characters, fixing length of matched prefixes/suffixes and exiting early on falling off the the trie
Al
2015-08-10 16:02:38 -04:00
51572d6575[phrases] Changing prefix/suffix chars so both are control characters and neither is the NUL-byte. Modifying transliteration special characters accordingly
Al
2015-08-10 16:01:22 -04:00
11a9881988[phrases] adding _from_index_get_prefix_char/_from_index_get_suffix_char methods
Al
2015-08-09 03:38:28 -04:00
2eb67ad850[phrases] trie_search_prefixes/trie_search_suffixes now take a length param
Al
2015-08-09 02:01:37 -04:00