Commit Graph

1853 Commits

Author SHA1 Message Date
Al
22e8178a97 [countries] Adding module for getting official country names in every language from CLDR + a dictionary of local language names 2015-09-29 21:10:38 -04:00
Al
daad1a1313 [geonames] Removing alternate names from geonames data set which are digits-only (most are not legitimate) 2015-09-28 17:46:53 -04:00
Al
f29f2f091b [fix] PEBCAK 2015-09-27 22:49:27 -04:00
Al
93b3110a49 [fix] only commas and hyphens need to be eliminated at the end of phrases in untagged address formatting 2015-09-27 19:25:34 -04:00
Al
d3bfaf6b43 [osm/formatting] Fixing formatting tagged addresses with comma separated fields 2015-09-27 03:19:23 -04:00
Al
d512201e2c [fix] removing space from tokens in address formatting 2015-09-27 02:18:34 -04:00
Al
5b829cd5a7 [fix] blank values containing punctuation in formatting 2015-09-26 21:49:28 -04:00
Al
dac0440be8 [fix] rsplit 2015-09-26 21:07:54 -04:00
Al
ae93552455 [osm/formatting] Moving back to openvenues repo pending resolution of the Turkish address issue 2015-09-26 03:56:52 -04:00
Al
0c792a2cc3 [osm/formatting] Changing the way the formatter elimiates inter-component separators, changing repo back to OpenCageData after pull request merge 2015-09-26 03:21:26 -04:00
Al
5417b4e602 [unicode] Downloading latest UnicodeData.txt instead of using builtin Python module (out of date) e.g. for getting unicode codepoint categories 2015-09-25 23:59:38 -04:00
Al
8fe791a14a [fix] ensure_dir in file downloads 2015-09-25 17:05:22 -04:00
Al
646b9f7248 [osm/formatting] Continuing to use openvenues formatter for the India fix 2015-09-25 13:36:24 -04:00
Al
9901dd2aac [fix] Switching address formatter back to OpenCageData repo 2015-09-24 18:42:17 -04:00
Al
3ce1669c30 [fix] import 2015-09-24 01:25:00 -04:00
Al
c85ce0b11d [osm/formatting] Tagging separators as well in tagged output of the address formatter 2015-09-24 01:22:49 -04:00
Al
abfb1d4a60 [transliteration] Wide char support in transliteration data generator 2015-09-23 03:56:12 -04:00
Al
7e057b0fb8 [utils] basic functions for wide char support for narrow Python builds (unichr, ord, unicode iteration) 2015-09-23 00:42:54 -04:00
Al
8562c7a5cb [unicode] Adding wide char support for language disambiguation (comes up in venue names), despite the likelihood of running on a narrow Python build. Rolling back common script chars at a script break, so in the case of e.g. Cyrllic name (Latin name), the segmentation is done at the space before the paren. 2015-09-23 00:37:59 -04:00
Al
13bcc35523 [unicode] Allowing wide chars in unicode properties 2015-09-23 00:34:07 -04:00
Al
b4593b6f88 [unicode/tokenization] Using new character classes including wide chars in scanner 2015-09-23 00:33:14 -04:00
Al
a76831df7a [unicode] Wide version of word breaks 2015-09-22 18:55:33 -04:00
Al
25917cfb17 [fix] scripts 2015-09-22 15:15:30 -04:00
Al
b405a53fe1 [fix] chars out of range in get_string_script Python version 2015-09-22 08:14:27 -04:00
Al
ca25b48687 [fix] Not writing empty fields in formatted addresses 2015-09-22 08:13:55 -04:00
Al
747de1944b [fix] Accounting for unknown scripts in disambiguation 2015-09-21 18:05:28 -04:00
Al
134cf616d6 [osm] Using street for language disambiguation in training data 2015-09-21 04:09:15 -04:00
Al
84cf21df88 [osm] Separating address formatter into its own module, adding some documentation of the various training sets with examples 2015-09-20 20:05:46 -04:00
Al
6731395ca0 [osm] Separating tagged from untagged output 2015-09-19 14:11:47 -04:00
Al
35f1c02caf [polygons] Reducing simplify tolerance for language polys now that regional languages are handled separately 2015-09-10 12:44:13 -07:00
Al
440a8158b6 [polygons] Adding in country languages for regional polygons without a default language 2015-09-10 12:34:26 -07:00
Al
fca7f21b1d [polygons] Making simplify_tolerance and preserve_topology for polygon simplification configurable per class 2015-09-10 11:06:18 -07:00
Al
b85fe50fad [osm] Training data for toponyms only cares about valid languages for name field 2015-09-08 16:38:05 -07:00
Al
e566063343 [osm] Doing an all-to-nodes conversion and an additional filter on the borders data set 2015-09-08 09:18:08 -07:00
Al
8525529968 [osm] Not requiring qualified name tags to process OSM toponyms 2015-09-06 21:03:01 -07:00
Al
df20e2cbc0 [osm] Including toponyms in the training data for countries where the unqualified place names can be assumed to be examples of a given language 2015-09-04 14:13:33 -04:00
Al
17fcfa8b59 [fix] adding house to ignore keys rather than aliasing it 2015-09-04 12:40:08 -04:00
Al
d64a27bc57 [osm] Converting relations to nodes in borders training data 2015-09-04 12:32:25 -04:00
Al
168b7f59da [fix] default indices in strip_component 2015-09-04 12:29:47 -04:00
Al
64db63e3eb [osm] Removing house tag 2015-09-04 12:23:47 -04:00
Al
6a20ce5e85 [language_id] Adding formatted addresses and toponyms to language training data 2015-09-04 01:46:49 -04:00
Al
4ebdca0ea7 [fix] var 2015-09-03 21:01:20 -04:00
Al
8345afbcd0 [fix] exclude country toponyms where the default languages is well represented 2015-09-03 20:56:58 -04:00
Al
20bb191624 [fix] chaining 2015-09-03 20:52:00 -04:00
Al
e7cf5000fe [fix] Exclude polygons with > 1 regional language 2015-09-03 20:48:04 -04:00
Al
9a9530c1b9 [fix] unqualified names 2015-09-03 20:37:22 -04:00
Al
a5fdd911d8 [fix] only use name key for default names 2015-09-03 20:35:08 -04:00
Al
d8e1432533 [osm] Adding unqualified names in single-language countries 2015-09-03 20:31:49 -04:00
Al
b15d2d70aa [fix] top language 2015-09-03 20:09:46 -04:00
Al
44bf94a158 [osm] Better borders training data set (only need the metadata, not the polygons) 2015-09-03 20:09:03 -04:00