Commit Graph

1176 Commits

Author SHA1 Message Date
Al
bd1e8ecaf8 [fix] default address parser dir 2015-12-12 12:55:37 -05:00
Al
2950358697 [build] address_parser client now links to libpostal, adding address_parser to download script with an "all" option 2015-12-12 12:49:50 -05:00
Al
88836e56e1 [api] Adding parse_address implementation to the libpostal API. GeoDB and address parser are now required. Stripping punctuation from the normalized output 2015-12-12 12:47:44 -05:00
Al
bce6ba2595 [fix] typedef 2015-12-12 11:58:41 -05:00
Al
a8d6cc4053 [api] Moving parse_address definition into libpostal.h 2015-12-12 03:55:31 -05:00
Al
fe4c528f26 [parser] Using different char_array for each of the potential phrases as token i 2015-12-12 03:23:26 -05:00
Al
e6303f70f3 [fix] removing printf 2015-12-11 02:53:22 -05:00
Al
671dd4a5d2 [parser] Fixing possible invalid writes in training for values beginning with a separator 2015-12-11 02:05:05 -05:00
Al
743b74aea5 [parser] Simplifying args in address_parser_data_set_tokenize_line 2015-12-10 18:48:23 -05:00
Al
1d288954d7 [osm] Fixing an issue in the training data with house numbers in OSM (seen mostly in Uruguay) where a comma separated list of house numbers is entered. 2015-12-10 18:46:28 -05:00
Al
88b8023ac8 [fix] Bug in address parser feature extraction, can hold onto the wrong pointer 2015-12-10 18:42:28 -05:00
Al
3de59506ae [parser] Internal separators for parsing purposes include open/close parens, at sign, semicolon, etc. Ignore stray colons not internal to a word (as in Swedish abbreviations) 2015-12-10 18:08:51 -05:00
Al
71d6d3c5e1 [utils] Removing kvec and using similar implementation with pointers that can be passed around 2015-12-10 17:52:23 -05:00
Al
ab205eff96 [utils] Adding a default small size to all arrays based on a look at malloc/realloc usage 2015-12-09 19:46:09 -05:00
Al
779298360c [osm] In cases with more than one official language and where the address language can be determined, use it for looking up language-specific OSM polygons 2015-12-09 01:00:59 -05:00
Al
aeb72d7d26 [osm] Randomly select up to n components for state_district OSM boundaries. For all other fields select one name at random 2015-12-09 00:20:20 -05:00
Al
2c254ebc5e [fix] Belgium cities again 2015-12-08 23:09:28 -05:00
Al
f252869671 [dictionaries] adding ste to English dictionaries 2015-12-08 22:29:52 -05:00
Al
69a469d9d3 [osm] Choosing a language at random in countries with multilingual addresses for the parser training data so we get some monolingual examples 2015-12-08 20:38:32 -05:00
Al
fe37286bcf [fix] Fixes to matrix methods 2015-12-08 17:33:38 -05:00
Al
d9d53ce17e [math] Matrix method updates 2015-12-08 15:39:52 -05:00
Al
48ee665e71 [scripts] Benchmark script using default options 2015-12-08 15:38:44 -05:00
Al
2fcc72ae07 [fix] multitoken canonical strings 2015-12-08 15:38:04 -05:00
Al
a857138d95 [api] Adding place name expansions by default 2015-12-08 15:31:36 -05:00
Al
beec43fe15 [expansion] regenerating expansion data 2015-12-08 15:28:54 -05:00
Al
35db855819 [fix] canonical index in address expansion data, should be -1 for all canonical phrases 2015-12-08 15:09:51 -05:00
Al
e1ea2ac704 [expansion] Toponym dictionaries can apply to street names and place names 2015-12-08 02:10:22 -05:00
Al
bfc517ae42 [fix] Belgium districts 2015-12-07 22:11:11 -05:00
Al
cbe5cd7429 [expansion] The ambiguous expansions dictionary shouldn't add to the component bitset 2015-12-07 20:36:56 -05:00
Al
d35f519629 [expansion] Fixing case where non-ideographic tokens like # can potentially be concatenated with surrounding tokens and should normalized with whitespace in between 2015-12-07 19:18:46 -05:00
Al
f5739dd42b [math] Signatures for array_exp and array_log 2015-12-07 18:10:04 -05:00
Al
0d8d396108 [expansion] Fixing cases like ML King where a global (all languages) expansion subsumes the specific language expansion (like English) 2015-12-07 18:09:25 -05:00
Al
9bab70909d [numex] Always adding a version of the string without Roman numeral expansion since many times those tokens can be ambiguous 2015-12-07 14:29:18 -05:00
Al
f8a3081d0f [fix] city name in OSM formatting 2015-12-07 02:33:12 -05:00
Al
a066ee9aad [math] Only reallocate on matrix_resize if needed 2015-12-07 01:20:16 -05:00
Al
cfd0dc69f2 [parsing] Using the entire phrase as the ith word 2015-12-07 01:19:38 -05:00
Al
8186e2606e [dictionaries] Regenerating address expansion data file 2015-12-06 16:56:27 -05:00
Al
4dba0c54e4 [dictionaries] Adding state abbreviations for US, CA and AU into dictionaries 2015-12-06 16:47:36 -05:00
Al
b25a738000 [osm] Doing more deduping in the OSM training data to avoid confusing the parser when city, state, district all have the same name 2015-12-06 16:14:02 -05:00
Al
44f7fd0844 [math] Matrix resize 2015-12-06 03:20:03 -05:00
Al
dd8f8b4d7b [fix] prefix/suffix regexes 2015-12-05 18:41:22 -05:00
Al
5fcb6d2c30 [fix] typo 2015-12-05 16:23:58 -05:00
Al
3a7ba0288f [fix] .get 2015-12-05 16:13:15 -05:00
Al
c92a6de477 [fix] name 2015-12-05 15:49:50 -05:00
Al
2a4210f93f [osm] Stripping standard city prefixes/suffies e.g. Township of 2015-12-05 15:42:22 -05:00
Al
596c5ffdd3 [fix] Tokenized trie search 2015-12-05 15:21:52 -05:00
Al
24208c209f [parsing] Adding a training data derived index of complete phrases from suburb up to country. Only adding bias and word features for non phrases, using UNKNOWN_WORD and UNKNOWN_NUMERIC for infrequent tokens (not meeting minimum vocab count threshold). 2015-12-05 14:34:19 -05:00
Al
f41158b8b3 [osm] Avoid using the alternate name (e.g. Brooklyn instead of Kings County) when it is the same as city 2015-12-05 14:21:07 -05:00
Al
7c26317903 [fix] osm components 2015-12-03 19:30:15 -05:00
Al
42a8890652 [osm] Only removing local language city if there are prior components from OSM 2015-12-03 19:11:03 -05:00