Commit Graph

2033 Commits

Author SHA1 Message Date
Al
b4593b6f88 [unicode/tokenization] Using new character classes including wide chars in scanner 2015-09-23 00:33:14 -04:00
Al
a76831df7a [unicode] Wide version of word breaks 2015-09-22 18:55:33 -04:00
Al
25917cfb17 [fix] scripts 2015-09-22 15:15:30 -04:00
Al
b405a53fe1 [fix] chars out of range in get_string_script Python version 2015-09-22 08:14:27 -04:00
Al
ca25b48687 [fix] Not writing empty fields in formatted addresses 2015-09-22 08:13:55 -04:00
Al
747de1944b [fix] Accounting for unknown scripts in disambiguation 2015-09-21 18:05:28 -04:00
Al
134cf616d6 [osm] Using street for language disambiguation in training data 2015-09-21 04:09:15 -04:00
Al
84cf21df88 [osm] Separating address formatter into its own module, adding some documentation of the various training sets with examples 2015-09-20 20:05:46 -04:00
Al
6731395ca0 [osm] Separating tagged from untagged output 2015-09-19 14:11:47 -04:00
Al
35f1c02caf [polygons] Reducing simplify tolerance for language polys now that regional languages are handled separately 2015-09-10 12:44:13 -07:00
Al
440a8158b6 [polygons] Adding in country languages for regional polygons without a default language 2015-09-10 12:34:26 -07:00
Al
fca7f21b1d [polygons] Making simplify_tolerance and preserve_topology for polygon simplification configurable per class 2015-09-10 11:06:18 -07:00
Al
b85fe50fad [osm] Training data for toponyms only cares about valid languages for name field 2015-09-08 16:38:05 -07:00
Al
e566063343 [osm] Doing an all-to-nodes conversion and an additional filter on the borders data set 2015-09-08 09:18:08 -07:00
Al
8525529968 [osm] Not requiring qualified name tags to process OSM toponyms 2015-09-06 21:03:01 -07:00
Al
df20e2cbc0 [osm] Including toponyms in the training data for countries where the unqualified place names can be assumed to be examples of a given language 2015-09-04 14:13:33 -04:00
Al
17fcfa8b59 [fix] adding house to ignore keys rather than aliasing it 2015-09-04 12:40:08 -04:00
Al
d64a27bc57 [osm] Converting relations to nodes in borders training data 2015-09-04 12:32:25 -04:00
Al
168b7f59da [fix] default indices in strip_component 2015-09-04 12:29:47 -04:00
Al
64db63e3eb [osm] Removing house tag 2015-09-04 12:23:47 -04:00
Al
6a20ce5e85 [language_id] Adding formatted addresses and toponyms to language training data 2015-09-04 01:46:49 -04:00
Al
4ebdca0ea7 [fix] var 2015-09-03 21:01:20 -04:00
Al
8345afbcd0 [fix] exclude country toponyms where the default languages is well represented 2015-09-03 20:56:58 -04:00
Al
20bb191624 [fix] chaining 2015-09-03 20:52:00 -04:00
Al
e7cf5000fe [fix] Exclude polygons with > 1 regional language 2015-09-03 20:48:04 -04:00
Al
9a9530c1b9 [fix] unqualified names 2015-09-03 20:37:22 -04:00
Al
a5fdd911d8 [fix] only use name key for default names 2015-09-03 20:35:08 -04:00
Al
d8e1432533 [osm] Adding unqualified names in single-language countries 2015-09-03 20:31:49 -04:00
Al
b15d2d70aa [fix] top language 2015-09-03 20:09:46 -04:00
Al
44bf94a158 [osm] Better borders training data set (only need the metadata, not the polygons) 2015-09-03 20:09:03 -04:00
Al
55af9b0a0c [fix] OSM address tagged training data formatting 2015-09-03 18:35:19 -04:00
Al
c6bfc0e021 [osm] Postponing punctuation stripping until after address template rendering 2015-09-03 18:13:41 -04:00
Al
d54fb25e45 [osm] don't bother with the R-tree check if there are no name:* tags in border data set 2015-09-03 17:54:40 -04:00
Al
33af61095b [fix] var 2015-09-03 17:49:52 -04:00
Al
294101ad80 [osm] Treating components that are all punctuation as blank in address parsing (e.g. a single comma) 2015-09-03 17:46:57 -04:00
Al
e1e5c16637 [osm] Not adding unqualified name tags to toponym data set, throwing out a few cases of language ambiguity 2015-09-03 16:50:30 -04:00
Al
040a26a6f2 [fix] import 2015-09-03 13:54:23 -04:00
Al
7787427c58 [fix] typo 2015-09-03 13:53:18 -04:00
Al
23633e95dd [osm] Only adding country default language toponyms to training data 2015-09-03 13:44:41 -04:00
Al
11c01f64d2 [osm] OrderedDict of attrs in OSM training data 2015-09-03 11:11:18 -04:00
Al
27eb4e4aed [osm] Adding a toponym language training set using planet-borders.osm (all admin borders) 2015-09-03 10:19:11 -04:00
Al
db57855c95 [osm] Switching formatter repo to the OpenVenues fork, with fixes and several dozen new countries added 2015-09-03 10:06:54 -04:00
Al
a916668f28 [i18n] Local file for ISO 15924 2015-09-01 23:58:36 -04:00
Al
a2ec8001b0 [osm] Removing postal code keys in formatted language training data 2015-08-24 14:08:36 -04:00
Al
8bbcb60aee [languages] Moving search_suffix and search_prefix into methods 2015-08-24 14:04:36 -04:00
Al
c68f56e61d [fix] paths 2015-08-24 12:58:27 -04:00
Al
d620cb6fc3 [fix] Calculating splits in Python rather than bash 2015-08-24 12:47:51 -04:00
Al
c754d275af [fix] str 2015-08-24 12:24:55 -04:00
Al
96cb289b79 [languages] Script to create language training/cross-validation/test data splits 2015-08-24 12:18:23 -04:00
Al
fa7b855ecb [languages] Earlier exit on finding ambiguous script spans 2015-08-24 03:07:57 -04:00