Commit Graph

726 Commits

Author SHA1 Message Date
Al
8926293063 [parser/cli] Using NFC normalization on the output in the parser client (closes #30). Optional command-line arg for parser output dir, useful for spot-checking different experiments 2016-07-21 17:04:57 -04:00
Al
44908ff95a [parser] No digit normalization in training data-derived parser phrases (for postcodes, etc.), phrases include the new island type, house number phrases if any are valid. Adjacent words are now full phrases if they are part of a multiword token like a city name. For hyphenated names like Carmel-by-the-Sea, adding a version to the phrase dictionary where the hyphens are replaced with spaces 2016-07-21 17:04:57 -04:00
Al
41ae742285 [fix] tokenized trie search when falling off the trie at the start of a valid phrase 2016-07-21 17:04:57 -04:00
Al
6e60b3bbda [fix] semicolon in #define 2016-07-21 17:04:57 -04:00
Al
b5d4dd6f37 [tokenization] Including full-width numbers in numeric tokens 2016-07-21 17:04:57 -04:00
Al
dd7ef6fabf [dictionaries] Making new component for near/nearby prepositions 2016-07-21 17:04:57 -04:00
Al
2454b98c6d [tokenization] Reverting commit for tokenizing initial/final apostrophes as part of words as it may be more effective to handle during post-processing 2016-07-21 17:04:57 -04:00
Al
0a8f46bdc3 [parser] Using new geonames designations in parser features 2016-07-21 17:04:57 -04:00
Al
c383f8af88 [parser] Using NFC normalization for parser as well, @ sign not defined as separator since it may also be used in intersections 2016-07-21 17:04:57 -04:00
Al
c2ee5a45b3 [geodb] Adding separate bitset for geonames place types and using NFC normalization instead of NFD (requires retraining) 2016-07-21 17:04:57 -04:00
Al
6c39c663ff [normalize] Adding NORMALIZE_STRING_COMPOSE for NFC unicode normalization 2016-07-21 17:04:57 -04:00
Al
757c6147cb [tokenization] Adding ability to tokenize 's Gravenhage 2016-07-21 17:04:57 -04:00
Al
2e8888e331 [fix] warnings/size_t in libpostal.c 2016-07-21 17:04:57 -04:00
Al
e800f21f06 [gazetteers] Adding new gazetteer types/address components 2016-07-21 17:04:57 -04:00
Al
e5e0cf3b92 [fix] loading transliteration module in address_parser_test.c as well 2016-07-21 17:04:57 -04:00
Al
b8d43dc601 [fix] cstring_array_split calls 2016-07-21 17:04:57 -04:00
Al
b19cd3f60a [fix] brace 2016-07-21 17:04:57 -04:00
Al
994b2f18e4 [parser] Ignore multiple spaces in parser input post-normalization. If normalizing the string creates several distinct tokens (namely in Vulgar fractions e.g. ½ => 1/2), add all the sub-tokens with the same label as the parent 2016-07-21 17:04:57 -04:00
Al
b664ab1cea [utils] Adding cstring_array_split_ignore_consecutive 2016-07-21 17:04:57 -04:00
Al
8e90ee45d2 [fix] calls and NULL checks 2016-07-21 17:04:57 -04:00
Al
e3cffaf0d1 [fix] tokenized_string_t should copy its source string 2016-07-21 17:04:57 -04:00
Al
16501aba17 [fix] Need to load transliteration module for Latin-ASCII normalization 2016-07-21 17:04:57 -04:00
Al
a9ba61585b [fix] Adding set -e to data download script so it fails if any subcommands fail 2016-05-04 23:08:06 -04:00
Al
9819ebf949 [fix] always include expansions in the ambiguous expansion dictionary, no matter which component 2016-04-29 13:26:13 -04:00
Al
0bc3550c11 [expansion] Adding address_expansion_in_dictionary 2016-04-29 13:23:48 -04:00
Al
59e5fcd1b4 [fix] LC_ALL=C in data download script 2016-04-11 12:47:50 -04:00
Travis
b8d4d71522 [auto][ci skip] Adding data files from Travis build #112 2016-03-30 20:04:52 +00:00
Al
14e8f50cf1 [fix] Expansions when passing in the address_components= option. Was only limiting results at the phrase level, should work at the individual expansion level 2016-03-29 16:46:29 -04:00
Travis
2795d258d1 [auto][ci skip] Adding data files from Travis build #108 2016-03-29 19:11:57 +00:00
Al
6dad58c696 [fix][ci skip] last remaining instance of vignt in libpostal 2016-03-29 12:51:19 -04:00
Travis
08d873ac15 [auto][ci skip] Adding data files from Travis build #105 2016-03-29 15:39:14 +00:00
Travis
49adcfe9b5 [auto][ci skip] Adding data files from Travis build #97 2016-03-22 14:33:13 +00:00
Al
25c8ba8603 [fix] Log more helpful error message in language_classifier if not loaded 2016-03-21 18:18:25 -04:00
Al
0356b45069 [fix] Log errors in numex module if not loaded 2016-03-21 18:15:53 -04:00
Al
943cd4443a [fix] Log errors if address dictionaries not loaded 2016-03-21 18:13:14 -04:00
Al
510f12ff96 [fix] Log error in transliteration if setup hasn't been called 2016-03-21 18:06:02 -04:00
Al
1b94727871 [fix] Check that parser is loaded in parse_address, log and return NULL instead of segfaulting 2016-03-21 18:04:26 -04:00
Al
be7b696cb2 [fix] actually that temporary array is unnecessary altogether, eliminating 2016-03-21 17:00:11 -04:00
Al
e0f7638372 [fix] Freeing up temporary char_array 2016-03-21 16:50:48 -04:00
Travis
14093a263d [auto][ci skip] Adding data files from Travis build #92 2016-03-21 16:43:23 +00:00
Travis
0dfd20f14d [auto][ci skip] Adding data files from Travis build #86 2016-03-16 20:37:31 +00:00
Travis
576e91d3fa [auto][ci skip] Adding data files from Travis build #84 2016-03-16 19:08:17 +00:00
Travis
2dc9643b29 [auto][ci skip] Adding data files from Travis build #82 2016-03-14 16:29:21 +00:00
Al
0d7f9f2032 [data] Using UTC dates for libpostal data file tracking for #38. Also silencing curl when checking if file was updated 2016-03-10 16:44:02 -05:00
Travis
c4203c6ea9 [auto][ci skip] Adding data files from Travis build #63 2016-03-06 18:00:40 +00:00
Travis
73140a8239 [auto][ci skip] Adding data files from Travis build #62 2016-03-06 17:51:23 +00:00
Travis
d8e0945d5b [auto][build] Adding data files from Travis build #57 2016-03-06 16:11:32 +00:00
Al
b5807926bc [fix] Using PRId64 in all cases for int64_t printf formatting 2016-03-02 16:47:49 -05:00
Al
72fa6c0a6c [fix] numex_table builder program using new API (heap-allocated strings) 2016-03-02 16:28:28 -05:00
Al
999a9e24cb [numex] Regenerating numex_data.c 2016-03-02 16:11:09 -05:00