Commit Graph

1154 Commits

Author SHA1 Message Date
Al
2fcc72ae07 [fix] multitoken canonical strings 2015-12-08 15:38:04 -05:00
Al
a857138d95 [api] Adding place name expansions by default 2015-12-08 15:31:36 -05:00
Al
beec43fe15 [expansion] regenerating expansion data 2015-12-08 15:28:54 -05:00
Al
35db855819 [fix] canonical index in address expansion data, should be -1 for all canonical phrases 2015-12-08 15:09:51 -05:00
Al
e1ea2ac704 [expansion] Toponym dictionaries can apply to street names and place names 2015-12-08 02:10:22 -05:00
Al
bfc517ae42 [fix] Belgium districts 2015-12-07 22:11:11 -05:00
Al
cbe5cd7429 [expansion] The ambiguous expansions dictionary shouldn't add to the component bitset 2015-12-07 20:36:56 -05:00
Al
d35f519629 [expansion] Fixing case where non-ideographic tokens like # can potentially be concatenated with surrounding tokens and should normalized with whitespace in between 2015-12-07 19:18:46 -05:00
Al
f5739dd42b [math] Signatures for array_exp and array_log 2015-12-07 18:10:04 -05:00
Al
0d8d396108 [expansion] Fixing cases like ML King where a global (all languages) expansion subsumes the specific language expansion (like English) 2015-12-07 18:09:25 -05:00
Al
9bab70909d [numex] Always adding a version of the string without Roman numeral expansion since many times those tokens can be ambiguous 2015-12-07 14:29:18 -05:00
Al
f8a3081d0f [fix] city name in OSM formatting 2015-12-07 02:33:12 -05:00
Al
a066ee9aad [math] Only reallocate on matrix_resize if needed 2015-12-07 01:20:16 -05:00
Al
cfd0dc69f2 [parsing] Using the entire phrase as the ith word 2015-12-07 01:19:38 -05:00
Al
8186e2606e [dictionaries] Regenerating address expansion data file 2015-12-06 16:56:27 -05:00
Al
4dba0c54e4 [dictionaries] Adding state abbreviations for US, CA and AU into dictionaries 2015-12-06 16:47:36 -05:00
Al
b25a738000 [osm] Doing more deduping in the OSM training data to avoid confusing the parser when city, state, district all have the same name 2015-12-06 16:14:02 -05:00
Al
44f7fd0844 [math] Matrix resize 2015-12-06 03:20:03 -05:00
Al
dd8f8b4d7b [fix] prefix/suffix regexes 2015-12-05 18:41:22 -05:00
Al
5fcb6d2c30 [fix] typo 2015-12-05 16:23:58 -05:00
Al
3a7ba0288f [fix] .get 2015-12-05 16:13:15 -05:00
Al
c92a6de477 [fix] name 2015-12-05 15:49:50 -05:00
Al
2a4210f93f [osm] Stripping standard city prefixes/suffies e.g. Township of 2015-12-05 15:42:22 -05:00
Al
596c5ffdd3 [fix] Tokenized trie search 2015-12-05 15:21:52 -05:00
Al
24208c209f [parsing] Adding a training data derived index of complete phrases from suburb up to country. Only adding bias and word features for non phrases, using UNKNOWN_WORD and UNKNOWN_NUMERIC for infrequent tokens (not meeting minimum vocab count threshold). 2015-12-05 14:34:19 -05:00
Al
f41158b8b3 [osm] Avoid using the alternate name (e.g. Brooklyn instead of Kings County) when it is the same as city 2015-12-05 14:21:07 -05:00
Al
7c26317903 [fix] osm components 2015-12-03 19:30:15 -05:00
Al
42a8890652 [osm] Only removing local language city if there are prior components from OSM 2015-12-03 19:11:03 -05:00
Al
ab0a4e622d [formatting] Switching back over to OpenCageData 2015-12-03 18:03:21 -05:00
Al
5af95ee613 [osm] Adding GeoNames abbreviated city names in a small percentage of cases to get variations like NYC, BK, SF, etc. in the training data 2015-12-03 18:00:05 -05:00
Al
25e89bcc41 [fix] tokenized trie search edge case where tail is stored on the space node 2015-12-03 12:25:21 -05:00
Al
218361f43f [osm] Removing multilinestring boundaries from OSM polygon index (often partial boundaries e.g. France-Germany) 2015-12-03 00:51:09 -05:00
Al
43287db90a [normalization/phrases] Fixing a bug which occurs with an already-separated elision 2015-12-02 16:04:39 -05:00
Al
87c04b4d37 [fix] path in setup.py 2015-12-02 14:22:11 -05:00
Al
09a3e2ab64 [fix] pip install command 2015-12-02 13:43:57 -05:00
Al
746b5d0f34 [fix] transliterate using string_equals 2015-12-02 13:09:43 -05:00
Al
d0aaff1482 [utils] string_equals with NULL check 2015-12-01 13:12:08 -05:00
Al
f322ae0a1c [build] adding shuffle.c to Makefile rule 2015-12-01 11:28:33 -05:00
Al
b94264b745 [parser] Forgot to add shuffle.h/.c 2015-12-01 11:25:28 -05:00
Al
116fe857db [parser] gshuf (Mac equivalent of shuf) is quite a bit slower than shuf, so removing it. Need to train on Linux unless a better alternative is found for shuffling large files on Mac 2015-12-01 11:24:44 -05:00
Al
8484d4fffd [fix] venue names should be removed probabilistically in the training data, giving neighborhoods a slightly better chance of being included 2015-11-30 23:28:12 -05:00
Al
6ef40c1769 [fix] dupe checking 2015-11-30 18:43:11 -05:00
Al
af170de019 [fix] Smaller probabilities on adding neighborhoods and admin polygons, eliminating duplicates on the row level 2015-11-30 18:35:31 -05:00
Al
621fd79002 [fix] var 2015-11-30 18:20:26 -05:00
Al
b430fb7657 [osm/formatting] Adding pick random name logic to neighborhoods as well, getting rid of drop probabilities as they're covered elsewhere, adding several forms of venue names to the training data 2015-11-30 18:10:18 -05:00
Al
d4b6450f19 [formatting] Not applying template replacements from address formatting by default 2015-11-30 16:11:13 -05:00
Al
839a12b212 [osm/formatting] Changing drop probabilities and doing it in random order 2015-11-30 15:27:35 -05:00
Al
5f13041140 [parsing/build] Makefile changes for address parser 2015-11-30 14:51:43 -05:00
Al
4ca911baf8 [parsing] Adding a command-line client (with history) to test address parsing 2015-11-30 14:51:01 -05:00
Al
89677d94a3 [parsing] Initial commit of the address parser, training/testing, feature function, I/O 2015-11-30 14:48:13 -05:00