Commit Graph

80 Commits

Author SHA1 Message Date
Al
1948aa87ea [fix] typo 2015-10-03 14:33:45 -04:00
Al
22efce7337 [osm/parsing] Randomly replacing country codes with local and foreign language expansions as well as randomly expanding state abbreviations to make parser more robust to different input 2015-10-03 14:31:51 -04:00
Al
db71b65412 [fix] checking validity of component combination 2015-10-02 20:28:45 -04:00
Al
a2fd6e25f8 [fix] import 2015-10-02 20:25:48 -04:00
Al
49abb70b59 [fix] dictionary 2015-10-02 20:24:21 -04:00
Al
521f33d892 [fix] bitset for address components, only looking at valid component keys 2015-10-02 20:21:59 -04:00
Al
528285f735 [fix] only OSM tagged addresses need extra logic 2015-10-02 20:18:30 -04:00
Al
83aecb9f2c [osm/parsing] Making tagged training data for address parser more robust to the types of partial input we see in geocoding by randomly eliminating components subject to some constraints (e.g. house number cannot be used without a street name) 2015-10-02 19:54:28 -04:00
Al
ca25b48687 [fix] Not writing empty fields in formatted addresses 2015-09-22 08:13:55 -04:00
Al
134cf616d6 [osm] Using street for language disambiguation in training data 2015-09-21 04:09:15 -04:00
Al
84cf21df88 [osm] Separating address formatter into its own module, adding some documentation of the various training sets with examples 2015-09-20 20:05:46 -04:00
Al
6731395ca0 [osm] Separating tagged from untagged output 2015-09-19 14:11:47 -04:00
Al
b85fe50fad [osm] Training data for toponyms only cares about valid languages for name field 2015-09-08 16:38:05 -07:00
Al
8525529968 [osm] Not requiring qualified name tags to process OSM toponyms 2015-09-06 21:03:01 -07:00
Al
df20e2cbc0 [osm] Including toponyms in the training data for countries where the unqualified place names can be assumed to be examples of a given language 2015-09-04 14:13:33 -04:00
Al
17fcfa8b59 [fix] adding house to ignore keys rather than aliasing it 2015-09-04 12:40:08 -04:00
Al
168b7f59da [fix] default indices in strip_component 2015-09-04 12:29:47 -04:00
Al
64db63e3eb [osm] Removing house tag 2015-09-04 12:23:47 -04:00
Al
4ebdca0ea7 [fix] var 2015-09-03 21:01:20 -04:00
Al
8345afbcd0 [fix] exclude country toponyms where the default languages is well represented 2015-09-03 20:56:58 -04:00
Al
20bb191624 [fix] chaining 2015-09-03 20:52:00 -04:00
Al
e7cf5000fe [fix] Exclude polygons with > 1 regional language 2015-09-03 20:48:04 -04:00
Al
9a9530c1b9 [fix] unqualified names 2015-09-03 20:37:22 -04:00
Al
a5fdd911d8 [fix] only use name key for default names 2015-09-03 20:35:08 -04:00
Al
d8e1432533 [osm] Adding unqualified names in single-language countries 2015-09-03 20:31:49 -04:00
Al
b15d2d70aa [fix] top language 2015-09-03 20:09:46 -04:00
Al
55af9b0a0c [fix] OSM address tagged training data formatting 2015-09-03 18:35:19 -04:00
Al
c6bfc0e021 [osm] Postponing punctuation stripping until after address template rendering 2015-09-03 18:13:41 -04:00
Al
d54fb25e45 [osm] don't bother with the R-tree check if there are no name:* tags in border data set 2015-09-03 17:54:40 -04:00
Al
33af61095b [fix] var 2015-09-03 17:49:52 -04:00
Al
294101ad80 [osm] Treating components that are all punctuation as blank in address parsing (e.g. a single comma) 2015-09-03 17:46:57 -04:00
Al
e1e5c16637 [osm] Not adding unqualified name tags to toponym data set, throwing out a few cases of language ambiguity 2015-09-03 16:50:30 -04:00
Al
040a26a6f2 [fix] import 2015-09-03 13:54:23 -04:00
Al
7787427c58 [fix] typo 2015-09-03 13:53:18 -04:00
Al
23633e95dd [osm] Only adding country default language toponyms to training data 2015-09-03 13:44:41 -04:00
Al
11c01f64d2 [osm] OrderedDict of attrs in OSM training data 2015-09-03 11:11:18 -04:00
Al
27eb4e4aed [osm] Adding a toponym language training set using planet-borders.osm (all admin borders) 2015-09-03 10:19:11 -04:00
Al
db57855c95 [osm] Switching formatter repo to the OpenVenues fork, with fixes and several dozen new countries added 2015-09-03 10:06:54 -04:00
Al
a2ec8001b0 [osm] Removing postal code keys in formatted language training data 2015-08-24 14:08:36 -04:00
Al
e70c2453ee [fix] import 2015-08-22 15:04:30 -04:00
Al
3902715258 [osm] Some countries like Lebanon in OSM will list the same address under two languages (French/English), which creates an unreasonable task for a linear classifier, so running disambiguation in those cases 2015-08-22 14:11:49 -04:00
Al
4976be64e5 [fix] var name 2015-08-21 08:02:26 -04:00
Al
8e56568cab [fix] typo 2015-08-21 08:01:49 -04:00
Al
ca6d802a43 [languages] Moving language id methods into a separate package 2015-08-21 08:00:56 -04:00
Al
9d2f7e4bd1 [fix] var name 2015-08-18 16:20:12 -04:00
Al
0528d1b578 [osm] OSM untagged formatted addresses try to use language namespaced tags 2015-08-18 16:18:27 -04:00
Al
c09cb4dd82 [osm] OSM untagged formatted addresses now use the new language labeling scheme 2015-08-18 15:13:10 -04:00
Al
3daba2ddcd [fix] removing debug print 2015-08-18 13:22:48 -04:00
Al
ffe76f0403 [languages/osm] Checking for existence of separable prefix/suffix in the given dictionaries 2015-08-18 12:10:06 -04:00
Al
0e00625dbd [languages/osm] Adding a primitive phrase dictionary to the OSM training data construction script and a few heuristics to help disambiguate in the case of small local language groups that may not be specified with name:lang tags e.g. Occitan, Catalan, Basque, Galician, etc. Also throwing away ambiguous multilanguage names 2015-08-18 11:12:27 -04:00