Al
34d3ae7e9e
[addresses] fixing normalized_place_name so it deals with things like Washington DC where Washington DC may actually be one of the OSM names
2016-12-10 17:52:38 -05:00
Al
80ee34cc3a
[text] adding normalization with whitespace
2016-12-10 17:50:53 -05:00
Al
4550f00f03
[fix] var name
2016-12-10 15:18:09 -05:00
Al
72771741c3
[fix] order
2016-12-10 15:16:35 -05:00
Al
8595d8da05
[addresses] don't add components to the trie that have the same normalized name as the given component
2016-12-10 15:12:40 -05:00
Al
bb12d0940e
[fix] options/docs in osm address training
2016-12-10 13:45:37 -05:00
Al
ffc584f679
[states] adding all forms of the state abbreviation to the trie when doing place name normalization to handle the D.C./DC case
2016-12-10 13:45:22 -05:00
Al
5098599ed6
[addresses] remove Quattroshapes/GeoNames cities as they may have problematic names, and in any case we have point-based cities from OSM now
2016-12-10 02:08:40 -05:00
Al
18c5fd0855
[fix] check for non-None city
2016-12-10 01:23:06 -05:00
Al
dc022f8652
[osm] adding normalized_place_name to Quattroshapes city
2016-12-10 01:20:40 -05:00
Al
7edb983566
[openaddresses] adding D.C. with periodds as the state for the DC data set
2016-12-09 19:58:57 -05:00
Al
c7b1818695
[fix] imports
2016-12-09 19:53:17 -05:00
Al
973466bb13
[states] adding multiple state abbreviations for states that can have periods in the naem like D.C., D.F. in Mexico and Brasil, etc.
2016-12-09 19:48:59 -05:00
Al
d575caba8a
[data] using UTC for libpostal data files on the Mac version of the download script as well
2016-12-09 19:43:05 -05:00
Al
c3f3896b48
[fix] update test for date function in data download script
2016-12-09 19:29:00 -05:00
Al
675552d254
[addresses] using normalized tokens when stripping off compound place names for things like D.C.
2016-12-09 17:52:57 -05:00
Al
c0a468d7e8
[normalization] adding a normalize_token function and some token options for deleting periods
2016-12-09 17:46:26 -05:00
Al
318773ffe7
[parser] header changes for the data set struct
2016-12-09 13:37:45 -05:00
Al
69ca4a85ce
[openaddresses] adding units to Olpympia training data
2016-12-09 03:45:15 -05:00
Al
8f30987bdf
[fix] checking if building is a rail station
2016-12-09 02:57:47 -05:00
Al
e92963de50
[openaddresses] adding new counties from OpenAddresses, strip commas option for thousands separators
2016-12-09 01:57:21 -05:00
Al
b60b7c9009
[geoplanet] adding an index of state_districts, states, etc. that contain a city with an identical name. Alias to the city if it's the only contained place, otherwise don't allow the admin name without the city.
2016-12-08 17:00:29 -05:00
Al
640f70c05d
[geoplanet] all_places table, specified dirs
2016-12-08 02:50:08 -05:00
Al
f9945103ba
[addresses] if suburb/city_district is already listed, and we're finding the closest city by point rather than by boundary, use the closest actual city, not something smaller like a village/hamlet
2016-12-08 02:39:27 -05:00
Al
28d9ef12c0
[geoplanet] fixing geoplanet aliases insert warning
2016-12-08 02:31:10 -05:00
Al
763c86dcd4
[geoplanet] add County to the names of US counties outside of Louisiana and Alaska, add Parish in Lousiana
2016-12-08 02:30:37 -05:00
Al
7d0c402a31
[openaddresses] adding Douglas County and Paulding County in GA. Jackson County and Rankin County in MS
2016-12-08 02:26:39 -05:00
Al
c2c2822936
[openaddresses] adding today's changes from OpenAddresses
2016-12-07 17:51:24 -05:00
Al
55c2f18896
[dictionaries] adding US highway and US route expansions
2016-12-07 14:39:27 -05:00
Al
42861aa38c
[names] adding New Zealand to places that normalize City as a suffix (not Australia though as it has some cities that actually do end in City)
2016-12-07 06:19:08 -05:00
Al
7436d9693a
[names] adding new name_affixes call to replace both prefixes/suffixes in one call, using in GeoPlanet training and the generic AddressComponents normalizations
2016-12-07 05:49:16 -05:00
Al
9386a999f6
[names] adding country-specific affixes and only normalizing the word City as a suffix in UK/Ireland
2016-12-07 05:37:25 -05:00
Al
a9209fae37
[openaddresses] adding Kenton County, KY
2016-12-06 23:04:21 -05:00
Al
b69914ff18
[openaddresses] adding Kansas City, MO
2016-12-06 22:56:31 -05:00
Al
3ff472c8cf
[openaddresses] fixing house numbers with multiple consecutive hyphens
2016-12-06 22:50:14 -05:00
Al
ae527ef5b1
[fix] indentation
2016-12-06 19:03:13 -05:00
Al
78615bf29c
[places] higher probability of state_district for non-city Ireland
2016-12-06 18:15:38 -05:00
Al
fddf21d1c1
[boundaries] moving Ireland counties back to state_district, regions to state (as they're typically used as admin1 in ISO, etc.)
2016-12-06 17:05:29 -05:00
Al
aae8a8acf0
[boundaries] adding a few more common prefixes (looks like in Ireland it's common enough to remove the County prefix)
2016-12-06 17:04:09 -05:00
Al
fadf0ca66b
[openaddresses] filename for Ward County, ND
2016-12-06 15:55:33 -05:00
Al
29590be406
[openaddresses] adding Kalmar, Sweden and Fribourg, Switzerland
2016-12-06 15:51:10 -05:00
Al
e13787a6f6
[fix] var name again
2016-12-05 18:49:23 -05:00
Al
e1c6eff5e2
[fix] var
2016-12-05 18:46:49 -05:00
Al
da36b71829
[addresses] adding new places index in OSM and OpenAddresses training data
2016-12-05 18:36:17 -05:00
Al
628fecea59
[addresses] adding point-based city/equivalent reverse geocoding for places that don't have as many defined polygons in OSM
2016-12-05 18:30:46 -05:00
Al
8509fe3ac0
[dictionaries] English dictionary fix
2016-12-05 18:24:27 -05:00
Al
f87f0df717
[places] adding generic place index for reverse geocoding to points
2016-12-05 02:05:54 -05:00
Al
e32c232c67
[localities] /planet-neighborhoods/planet-localities/
2016-12-04 23:05:11 -05:00
Al
cca80b046c
[abbreviation] fixing abbreviations within hyphenated phrases, particularly for prefix/suffix matches
2016-12-03 17:55:11 -05:00
Al
22c4e99ea0
[parser] As part of reading/tokenizing the address parser data set,
...
several copies of the same training example will be generated.
1. with only lowercasing
2. with simple Latin-ASCII normalization (no umlauts, only things that
are common to all languages)
3. basic UTF-8 normalizations (accent stripping)
4. language-specific Latin-ASCII transliteration (e.g. ü => ue in German)
This will apply both on the initial passes when building the phrase
gazetteers and during each iteration of training. In this way, only the
most basic normalizations like lowercasing need to be done at runtime
and it's possible to use only minimal normalizations like lowercasing.
May have a small effect on randomization as examples are created in a
deterministic order. However, this should not lead to cycles since the
base examples are shuffled, thus still satisfying the random permutation
requirement of an online/stochastic learning algorithm.
2016-12-02 13:09:03 -05:00