Commit Graph

4417 Commits

Author SHA1 Message Date
Al
05732f6718 [build] Makefile changes for new parser feature extraction 2016-12-29 02:39:29 -05:00
Al
091167ed3c [api] remove geodb from libpostal.c 2016-12-29 02:35:43 -05:00
Al
acd953ce51 [parser] first pass at new parser feature extraction
- removing geodb phrases
- use Latin-ASCII-simple transliteration (no umlauts, etc.)
- no digit normalization for admin component phrases and postcodes
- tag = START + word, special feature for first word in the sequence
- add the new admin boundary categories
- for hyphenated non-phrase words, add each sub-word
- for rare and unknown words, add ngram features of 3-6 characters with
  underscores to indicate beginnings and endings (similar to language
  classifier features)
- defines notion of "rare words" (known words with a frequency <= n where
  n > the unknown word threshold), so known words can share
  statistical strength with artificial and real unknown words
2016-12-29 02:17:35 -05:00
Al
e62101b8bf [parser] remove geodb from address_parser_test, sort confusion matrix 2016-12-29 02:14:40 -05:00
Al
174529e8d0 [parser] remove geodb and fix small memory leak in address_parser_train 2016-12-29 02:12:06 -05:00
Al
bde5fdfaad [merge] merging in master 2016-12-29 02:00:31 -05:00
Al
646d96e13e Merge remote-tracking branch 'origin/master' into parser-data 2016-12-29 01:58:38 -05:00
Al
a26a01ece3 [openaddresses] adding SEMCOG counties, MI 2016-12-28 19:37:44 -05:00
Al
22b4a215f4 [places] additional form for West Indies 2016-12-28 17:58:32 -05:00
Al
f58ebbdf7f [fix] var name 2016-12-28 14:37:00 -05:00
Al
7ee44a584b [fix] genitive case for Russian/Ukrainian toponyms, not locative (#125) 2016-12-28 14:34:28 -05:00
Al
e6e4b28e43 [addresses] making the город/г. prefix apply to the Russian language rather than the country 2016-12-28 13:26:19 -05:00
Al
f995fdf9d2 [fix] default None 2016-12-28 05:09:15 -05:00
Al
3dc6a69bf5 [openaddresses] adding locative names in OpenAddresses as well, which contains some Ukraine data sets 2016-12-28 04:59:55 -05:00
Al
91013fe296 [fix] moving checks inside the add_locatives function, fixing float cast 2016-12-28 04:59:27 -05:00
Al
6f009fb8a6 [addresses] adding pymorphy2 for converting Russian and Ukrainian place names (sticking with state and staet_district for the moment) to the locative case as mentioned in #125 2016-12-28 04:48:32 -05:00
Al
e91907a21b [boundaries] actually, the urban okrugs/districts seem to function more like neighborhoods in St Petersburg and Moscow, calling the raions city_district and the okrugs suburb 2016-12-28 01:36:11 -05:00
Travis
6c35eb9e65 [auto][ci skip] Adding data files from Travis build #186 2016-12-28 06:29:35 +00:00
Al
a86d6d5528 [merge] merging in master 2016-12-28 01:11:04 -05:00
Al Barrentine
47c3b0091b Merge pull request #147 from Komzpa/patch-1
Remove place names that are not place names (RU, BE)
2016-12-28 01:08:48 -05:00
Al
e23951a90f [dictionaries] new Ukrainian place names dictionary from http://wiki.openstreetmap.org/wiki/Nominatim/Special_Phrases/UK 2016-12-28 01:08:01 -05:00
Al
0bcaf816c4 [dictionaries] new Russian place names dictionary from http://wiki.openstreetmap.org/wiki/Nominatim/Special_Phrases/RU 2016-12-28 01:07:35 -05:00
Al
561d195be4 [fix] add global_overrides_last=True for federal cities in Russia 2016-12-28 00:49:13 -05:00
Al
4ce8f414ef [boundaries] adding Moscow and St Petersburg as cities despite technically having "state" boundaries 2016-12-28 00:25:20 -05:00
Al
12c7bed275 [fix] /exceptions/overrides/ 2016-12-28 00:16:22 -05:00
Al
1afe97b508 [fix] /containing/contained_by/ 2016-12-28 00:04:18 -05:00
Al
66eda96b75 [boundaries] admin_level=8 is city_district in Moscow and St Petersburg 2016-12-27 23:59:14 -05:00
Al
4344c5fdf3 [formatting] adding non-zero invert probabilities to all the former Soviet states. Other template insertions can still apply afterward for #125 2016-12-27 23:25:49 -05:00
Al
25e966411d [formatting] adding the ability to invert the address template (line by line, preserving order within each line) with certain probabilities 2016-12-27 23:25:49 -05:00
Al
1c17f1f2e2 [names/ru] adding г. (город) prefix to Russian city names 50% of the time in various forms per #125 2016-12-27 23:25:41 -05:00
Al
165056ccd8 [names] adding configurable prefix/suffix additions for boundary names 2016-12-27 20:32:23 -05:00
Travis
dc528affd5 [auto][ci skip] Adding data files from Travis build #184 2016-12-27 23:45:40 +00:00
Al Barrentine
2a42ea016b Merge pull request #148 from Komzpa/patch-2
Ukrainian place names that are actually whatever
2016-12-27 18:35:48 -05:00
Darafei Praliaskouski
e514778645 Ukrainian place names that are actually whatever 2016-12-27 15:21:05 +03:00
Darafei Praliaskouski
dba8c28e6a Remove Russian place names that are actually street names 2016-12-27 13:28:23 +03:00
Darafei Praliaskouski
38a6618e40 Remove Belarusian place names that are not place names
These all are parts of streets.
2016-12-27 13:26:30 +03:00
Al
80a9c1b308 [addresses] move country-specific cleanups to before reverse geocoding as those deal with the user-specified components 2016-12-27 04:19:57 -05:00
Al
d9c28ec160 [names] adding regional council and regional municipality to suffixes 2016-12-27 03:45:09 -05:00
Al
6163dbae39 [osm/places] adding option to only format place tags for city and smaller admins, using for polygons as larger polys should be included elsewhere anyway 2016-12-27 03:37:15 -05:00
Al
6eee689685 [fix] only applying separator tag to commas 2016-12-27 03:16:04 -05:00
Al
6192ac985a [names] one more for South Africa: District Municipality 2016-12-27 03:04:31 -05:00
Al
2cdf30a79e [names] same with Metropolitan Municipality 2016-12-27 02:48:55 -05:00
Al
2e3c1dee67 [names] add Local Municipality to English ignorable suffixes (seen in South Africa) 2016-12-27 02:45:58 -05:00
Al
76d8fc1d37 [fix] combined components 2016-12-26 21:35:27 -05:00
Al
c3bf63bc18 [fix] remove reference to ftfy in the formatter 2016-12-26 21:25:28 -05:00
Al
8abbb273b2 [osm] adding the excellent ftfy (https://github.com/LuminosoInsight/python-ftfy) to fix Mojibake, etc. in address components 2016-12-26 21:18:14 -05:00
Al
7ec368542b [formatting] giving single hyphens the separator tag 2016-12-26 21:00:25 -05:00
Al
d208397ecb [addresses] checking if component is generated in combining fields 2016-12-26 16:58:10 -05:00
Al
654fc2c463 [fix] memory cleanup in address_parser_data_set, logging any bad input lines 2016-12-26 16:18:15 -05:00
Al
e6d7b09e08 [expansions] adding generated expansion data 2016-12-26 16:16:59 -05:00