Commit Graph

732 Commits

Author SHA1 Message Date
Al
08f39d6b80 [parser] Adding address_parser_rewind to make multiple passes through the file when compiling the phrase tries 2016-07-28 17:13:58 -04:00
Al
1b09b7f2e5 [fix] Adding country_region to address_parser_train 2016-07-28 16:18:32 -04:00
Al
c6af5cc071 [parser] Adding country_region label to parser as a boundary component 2016-07-28 15:19:48 -04:00
Al
64f167f045 [tokenization] Re-generating scanner 2016-07-21 17:04:57 -04:00
Al
81b4a4a1cb [tokenization] Hyphens, etc. between non-ASCII digits (e.g. Unicode full-width numbers) should be single tokens 2016-07-21 17:04:57 -04:00
Al
be5fd79a48 [expansion] Prefix/suffix expansions by default can apply to ADDRESS_ANY but also inherit the types of any dictionary that lists their canonical form (so we can add suffixes without worrying about whether they're for streets or place names, etc.) 2016-07-21 17:04:57 -04:00
Al
8926293063 [parser/cli] Using NFC normalization on the output in the parser client (closes #30). Optional command-line arg for parser output dir, useful for spot-checking different experiments 2016-07-21 17:04:57 -04:00
Al
44908ff95a [parser] No digit normalization in training data-derived parser phrases (for postcodes, etc.), phrases include the new island type, house number phrases if any are valid. Adjacent words are now full phrases if they are part of a multiword token like a city name. For hyphenated names like Carmel-by-the-Sea, adding a version to the phrase dictionary where the hyphens are replaced with spaces 2016-07-21 17:04:57 -04:00
Al
41ae742285 [fix] tokenized trie search when falling off the trie at the start of a valid phrase 2016-07-21 17:04:57 -04:00
Al
6e60b3bbda [fix] semicolon in #define 2016-07-21 17:04:57 -04:00
Al
b5d4dd6f37 [tokenization] Including full-width numbers in numeric tokens 2016-07-21 17:04:57 -04:00
Al
dd7ef6fabf [dictionaries] Making new component for near/nearby prepositions 2016-07-21 17:04:57 -04:00
Al
2454b98c6d [tokenization] Reverting commit for tokenizing initial/final apostrophes as part of words as it may be more effective to handle during post-processing 2016-07-21 17:04:57 -04:00
Al
0a8f46bdc3 [parser] Using new geonames designations in parser features 2016-07-21 17:04:57 -04:00
Al
c383f8af88 [parser] Using NFC normalization for parser as well, @ sign not defined as separator since it may also be used in intersections 2016-07-21 17:04:57 -04:00
Al
c2ee5a45b3 [geodb] Adding separate bitset for geonames place types and using NFC normalization instead of NFD (requires retraining) 2016-07-21 17:04:57 -04:00
Al
6c39c663ff [normalize] Adding NORMALIZE_STRING_COMPOSE for NFC unicode normalization 2016-07-21 17:04:57 -04:00
Al
757c6147cb [tokenization] Adding ability to tokenize 's Gravenhage 2016-07-21 17:04:57 -04:00
Al
2e8888e331 [fix] warnings/size_t in libpostal.c 2016-07-21 17:04:57 -04:00
Al
e800f21f06 [gazetteers] Adding new gazetteer types/address components 2016-07-21 17:04:57 -04:00
Al
e5e0cf3b92 [fix] loading transliteration module in address_parser_test.c as well 2016-07-21 17:04:57 -04:00
Al
b8d43dc601 [fix] cstring_array_split calls 2016-07-21 17:04:57 -04:00
Al
b19cd3f60a [fix] brace 2016-07-21 17:04:57 -04:00
Al
994b2f18e4 [parser] Ignore multiple spaces in parser input post-normalization. If normalizing the string creates several distinct tokens (namely in Vulgar fractions e.g. ½ => 1/2), add all the sub-tokens with the same label as the parent 2016-07-21 17:04:57 -04:00
Al
b664ab1cea [utils] Adding cstring_array_split_ignore_consecutive 2016-07-21 17:04:57 -04:00
Al
8e90ee45d2 [fix] calls and NULL checks 2016-07-21 17:04:57 -04:00
Al
e3cffaf0d1 [fix] tokenized_string_t should copy its source string 2016-07-21 17:04:57 -04:00
Al
16501aba17 [fix] Need to load transliteration module for Latin-ASCII normalization 2016-07-21 17:04:57 -04:00
Al
a9ba61585b [fix] Adding set -e to data download script so it fails if any subcommands fail 2016-05-04 23:08:06 -04:00
Al
9819ebf949 [fix] always include expansions in the ambiguous expansion dictionary, no matter which component 2016-04-29 13:26:13 -04:00
Al
0bc3550c11 [expansion] Adding address_expansion_in_dictionary 2016-04-29 13:23:48 -04:00
Al
59e5fcd1b4 [fix] LC_ALL=C in data download script 2016-04-11 12:47:50 -04:00
Travis
b8d4d71522 [auto][ci skip] Adding data files from Travis build #112 2016-03-30 20:04:52 +00:00
Al
14e8f50cf1 [fix] Expansions when passing in the address_components= option. Was only limiting results at the phrase level, should work at the individual expansion level 2016-03-29 16:46:29 -04:00
Travis
2795d258d1 [auto][ci skip] Adding data files from Travis build #108 2016-03-29 19:11:57 +00:00
Al
6dad58c696 [fix][ci skip] last remaining instance of vignt in libpostal 2016-03-29 12:51:19 -04:00
Travis
08d873ac15 [auto][ci skip] Adding data files from Travis build #105 2016-03-29 15:39:14 +00:00
Travis
49adcfe9b5 [auto][ci skip] Adding data files from Travis build #97 2016-03-22 14:33:13 +00:00
Al
25c8ba8603 [fix] Log more helpful error message in language_classifier if not loaded 2016-03-21 18:18:25 -04:00
Al
0356b45069 [fix] Log errors in numex module if not loaded 2016-03-21 18:15:53 -04:00
Al
943cd4443a [fix] Log errors if address dictionaries not loaded 2016-03-21 18:13:14 -04:00
Al
510f12ff96 [fix] Log error in transliteration if setup hasn't been called 2016-03-21 18:06:02 -04:00
Al
1b94727871 [fix] Check that parser is loaded in parse_address, log and return NULL instead of segfaulting 2016-03-21 18:04:26 -04:00
Al
be7b696cb2 [fix] actually that temporary array is unnecessary altogether, eliminating 2016-03-21 17:00:11 -04:00
Al
e0f7638372 [fix] Freeing up temporary char_array 2016-03-21 16:50:48 -04:00
Travis
14093a263d [auto][ci skip] Adding data files from Travis build #92 2016-03-21 16:43:23 +00:00
Travis
0dfd20f14d [auto][ci skip] Adding data files from Travis build #86 2016-03-16 20:37:31 +00:00
Travis
576e91d3fa [auto][ci skip] Adding data files from Travis build #84 2016-03-16 19:08:17 +00:00
Travis
2dc9643b29 [auto][ci skip] Adding data files from Travis build #82 2016-03-14 16:29:21 +00:00
Al
0d7f9f2032 [data] Using UTC dates for libpostal data file tracking for #38. Also silencing curl when checking if file was updated 2016-03-10 16:44:02 -05:00