Al
|
00ce71223f
|
[osm] Using the default probabilities for abbreviations in ways training data
|
2016-01-24 00:53:41 -05:00 |
|
Al
|
bab7a0f961
|
[osm] splitting streets (way names) on semicolons
|
2016-01-24 00:42:25 -05:00 |
|
Al
|
7646adfc0f
|
[osm] Adding abbreviated street names in addition to the originals
|
2016-01-23 23:23:58 -05:00 |
|
Al
|
67130383ce
|
[fix] converting semicolons to commas in OSM house numbers and picking one at random
|
2016-01-23 23:16:19 -05:00 |
|
Al
|
1bb797f783
|
[fix] spacing in phrases
|
2016-01-23 21:59:49 -05:00 |
|
Al
|
3a8c3dfcf6
|
[fix] spacing in phrases at end of string
|
2016-01-23 21:51:40 -05:00 |
|
Al
|
78450bfad9
|
[fix] Spaces in abbreviation
|
2016-01-23 21:36:20 -05:00 |
|
Al
|
308ceb5a5f
|
[fix] convert UTF8 slices back to unicode before using with the Python trie
|
2016-01-23 20:20:23 -05:00 |
|
Al
|
5eb6bb309b
|
[fix] Only adding whitespace back into tokenized strings during abbreviation if it existed in the original string
|
2016-01-23 20:09:45 -05:00 |
|
Al
|
d61207e95a
|
[fix] var name
|
2016-01-23 18:01:02 -05:00 |
|
Al
|
e44cba1d06
|
[fix] geonames db not required in OSM training data
|
2016-01-23 17:59:55 -05:00 |
|
Al
|
4f03711e60
|
[osm] Adding abbreviated training examples to ways language training data
|
2016-01-23 14:10:47 -05:00 |
|
Al
|
c9fb4ee69d
|
[osm/formatting] Dropping state more often than not, except in the US and Canada where those fields are more commonly used
|
2016-01-22 17:58:24 -05:00 |
|
Al
|
ea9bb3f2d5
|
[fix] Abbreviation probabilities should only apply once, not once per dictionary. Also fixing issues where some of the abbreviations were doubled
|
2016-01-22 15:48:21 -05:00 |
|
Al
|
f9f6558e06
|
[fix] simple whitespace field splits for the limited format training data (used for language classification)
|
2016-01-22 04:34:42 -05:00 |
|
Al
|
cd1db7b288
|
[fix] Making sure rare components are dropped first, adding state and country back in
|
2016-01-22 04:17:19 -05:00 |
|
Al
|
adc3a00264
|
[fix] var name
|
2016-01-22 04:10:16 -05:00 |
|
Al
|
261beffa36
|
[fix] Actually better to remove country and state from rare components and let them use the standard dropout probabilities
|
2016-01-22 04:00:45 -05:00 |
|
Al
|
a6cc3d0114
|
[fix] Adding state to the more frequently dropped components
|
2016-01-22 03:56:38 -05:00 |
|
Al
|
bca3dae004
|
[fix] state full name probabilities for limited vs. full formatted OSM training sets
|
2016-01-22 03:54:20 -05:00 |
|
Al
|
d1cf253092
|
[osm/formatting] Higher probability of dropout for rare components like counties, etc.
|
2016-01-22 03:39:35 -05:00 |
|
Al
|
b22646ee30
|
[mv] Moving gazetteers into their own module
|
2016-01-22 03:15:56 -05:00 |
|
Al
|
6ac72576bc
|
[osm/formatting] Randomly abbreviating street names and venue names using all the available libpostal dictionaries. Refactoring OSM formatting into separate methods which can be individually tested. Adding override for special phrases like UK
|
2016-01-22 02:56:39 -05:00 |
|
Al
|
3262d2ccd3
|
[fix] arg count
|
2016-01-19 03:16:14 -05:00 |
|
Al
|
19a5541a85
|
[polygons/osm] append polygon nodes by vertices that connect to each other
|
2016-01-16 21:20:49 -05:00 |
|
Al
|
1d288954d7
|
[osm] Fixing an issue in the training data with house numbers in OSM (seen mostly in Uruguay) where a comma separated list of house numbers is entered.
|
2015-12-10 18:46:28 -05:00 |
|
Al
|
779298360c
|
[osm] In cases with more than one official language and where the address language can be determined, use it for looking up language-specific OSM polygons
|
2015-12-09 01:00:59 -05:00 |
|
Al
|
aeb72d7d26
|
[osm] Randomly select up to n components for state_district OSM boundaries. For all other fields select one name at random
|
2015-12-09 00:20:20 -05:00 |
|
Al
|
69a469d9d3
|
[osm] Choosing a language at random in countries with multilingual addresses for the parser training data so we get some monolingual examples
|
2015-12-08 20:38:32 -05:00 |
|
Al
|
f8a3081d0f
|
[fix] city name in OSM formatting
|
2015-12-07 02:33:12 -05:00 |
|
Al
|
b25a738000
|
[osm] Doing more deduping in the OSM training data to avoid confusing the parser when city, state, district all have the same name
|
2015-12-06 16:14:02 -05:00 |
|
Al
|
5fcb6d2c30
|
[fix] typo
|
2015-12-05 16:23:58 -05:00 |
|
Al
|
3a7ba0288f
|
[fix] .get
|
2015-12-05 16:13:15 -05:00 |
|
Al
|
c92a6de477
|
[fix] name
|
2015-12-05 15:49:50 -05:00 |
|
Al
|
2a4210f93f
|
[osm] Stripping standard city prefixes/suffies e.g. Township of
|
2015-12-05 15:42:22 -05:00 |
|
Al
|
f41158b8b3
|
[osm] Avoid using the alternate name (e.g. Brooklyn instead of Kings County) when it is the same as city
|
2015-12-05 14:21:07 -05:00 |
|
Al
|
7c26317903
|
[fix] osm components
|
2015-12-03 19:30:15 -05:00 |
|
Al
|
42a8890652
|
[osm] Only removing local language city if there are prior components from OSM
|
2015-12-03 19:11:03 -05:00 |
|
Al
|
5af95ee613
|
[osm] Adding GeoNames abbreviated city names in a small percentage of cases to get variations like NYC, BK, SF, etc. in the training data
|
2015-12-03 18:00:05 -05:00 |
|
Al
|
218361f43f
|
[osm] Removing multilinestring boundaries from OSM polygon index (often partial boundaries e.g. France-Germany)
|
2015-12-03 00:51:09 -05:00 |
|
Al
|
8484d4fffd
|
[fix] venue names should be removed probabilistically in the training data, giving neighborhoods a slightly better chance of being included
|
2015-11-30 23:28:12 -05:00 |
|
Al
|
6ef40c1769
|
[fix] dupe checking
|
2015-11-30 18:43:11 -05:00 |
|
Al
|
af170de019
|
[fix] Smaller probabilities on adding neighborhoods and admin polygons, eliminating duplicates on the row level
|
2015-11-30 18:35:31 -05:00 |
|
Al
|
621fd79002
|
[fix] var
|
2015-11-30 18:20:26 -05:00 |
|
Al
|
b430fb7657
|
[osm/formatting] Adding pick random name logic to neighborhoods as well, getting rid of drop probabilities as they're covered elsewhere, adding several forms of venue names to the training data
|
2015-11-30 18:10:18 -05:00 |
|
Al
|
839a12b212
|
[osm/formatting] Changing drop probabilities and doing it in random order
|
2015-11-30 15:27:35 -05:00 |
|
Al
|
89677d94a3
|
[parsing] Initial commit of the address parser, training/testing, feature function, I/O
|
2015-11-30 14:48:13 -05:00 |
|
Al
|
9a8ba14887
|
[osm/formatting] Adding per-field drop probabilities to OSM training data to make some fields more likely to be dropped, although it might create more training data
|
2015-11-30 11:10:12 -05:00 |
|
Al
|
15d9e00121
|
[osm/formatting] Adding in more ISO alpha-3 codes for countries in the training data
|
2015-11-28 14:08:07 -05:00 |
|
Al
|
66778737ff
|
[fix] non-local language states
|
2015-11-28 13:48:59 -05:00 |
|