Commit Graph

798 Commits

Author SHA1 Message Date
Al
dbe801fa08 [ngrams] changing args to ngrams 2016-12-21 18:09:45 -05:00
Al
6f37f9ae86 [merge] merging in master changes 2016-12-21 15:40:25 -05:00
Al
09b4e2ba2f [build] pulling in change from parser-data that allows user to pass CFLAGS 2016-12-21 14:39:27 -05:00
Al
3ac2c93e1c [utils] using renaming char_array_append_vjoined to char_array_add_vjoined to follow convention that add_* calls NUL-terminate while append_* calls do not 2016-12-18 15:26:58 -05:00
Al
3ed95a175e [ngrams] adding function to extract an array of ngrams from a string, with optional special prefixes/suffixes for the edges 2016-12-17 01:33:18 -05:00
Al
8f1e69960f [fix] loading transliteration module in address_parser_test.c as well 2016-12-12 11:37:27 -05:00
Al
3939dd0ca6 [fix] cstring_array_split calls 2016-12-12 11:37:27 -05:00
Al
a42d0e917a [fix] brace 2016-12-12 11:37:27 -05:00
Al
ced8f9ae27 [parser] Ignore multiple spaces in parser input post-normalization. If normalizing the string creates several distinct tokens (namely in Vulgar fractions e.g. ½ => 1/2), add all the sub-tokens with the same label as the parent 2016-12-12 11:37:27 -05:00
Al
b1816e9b70 [utils] Adding cstring_array_split_ignore_consecutive 2016-12-12 11:37:27 -05:00
Al
6baa7087fe [fix] calls and NULL checks 2016-12-12 11:37:27 -05:00
Al
5e07f5e8c5 [fix] tokenized_string_t should copy its source string 2016-12-12 11:37:27 -05:00
Al
521a094a47 [fix] Need to load transliteration module for Latin-ASCII normalization 2016-12-12 11:37:27 -05:00
Al
d575caba8a [data] using UTC for libpostal data files on the Mac version of the download script as well 2016-12-09 19:43:05 -05:00
Al
c3f3896b48 [fix] update test for date function in data download script 2016-12-09 19:29:00 -05:00
Al
318773ffe7 [parser] header changes for the data set struct 2016-12-09 13:37:45 -05:00
Al
22c4e99ea0 [parser] As part of reading/tokenizing the address parser data set,
several copies of the same training example will be generated.

1. with only lowercasing
2. with simple Latin-ASCII normalization (no umlauts, only things that
are common to all languages)
3. basic UTF-8 normalizations (accent stripping)
4. language-specific Latin-ASCII transliteration (e.g. ü => ue in German)

This will apply both on the initial passes when building the phrase
gazetteers and during each iteration of training. In this way, only the
most basic normalizations like lowercasing need to be done at runtime
and it's possible to use only minimal normalizations like lowercasing.

May have a small effect on randomization as examples are created in a
deterministic order. However, this should not lead to cycles since the
base examples are shuffled, thus still satisfying the random permutation
requirement of an online/stochastic learning algorithm.
2016-12-02 13:09:03 -05:00
Al
4b35da629f [numex] regenerated numex data file 2016-11-30 15:58:55 -08:00
Al
4677874610 [parser] stripping postal codes of phrases like CP (in Spanish) before adding them to the gazetteers, whether it's concatenated or a separate token. Adding a command-line argument for the number of iterations 2016-11-30 15:58:03 -08:00
Al
0e29cdd9fd [parser] fixing some uninitialized value issues during parser training 2016-11-30 15:42:09 -08:00
Al
f5a6bd0f36 [fix] sparse_matrix_new_from_matrix uses new matrix types 2016-11-30 10:15:12 -08:00
Al
b639fa5127 [utils] string_replace also creates a copy 2016-11-30 10:09:33 -08:00
Al
89f6611c4e [strings] string_trim makes a copy rather than modifying the pointer 2016-11-28 15:06:07 -08:00
Al
d922d9a60a [expansion] regenerated address_expansion_data.c 2016-11-28 10:47:15 -08:00
Al
f78281456a [fix] header defintion 2016-11-27 01:00:25 -08:00
Al
eea11beb6a [expansion] using easier-to-access data structure for address dictionaries 2016-11-27 00:56:48 -08:00
Al
7298c895c8 [utils] adding a chunked shuffle as the concatenated file sizes may get larger than memory 2016-11-21 14:04:34 -05:00
Travis
04f8130c46 [auto][ci skip] Adding data files from Travis build #168 2016-10-07 00:46:48 +00:00
Al
01afbf80ef [data] Each curl process will retry the chunk up to 3 times 2016-08-25 23:18:39 -04:00
Travis
de1255af00 [auto][ci skip] Adding data files from Travis build #161 2016-08-23 22:48:20 +00:00
Travis
f19c9852aa [auto][ci skip] Adding data files from Travis build #160 2016-08-23 22:24:19 +00:00
Travis
d797d6c863 [auto][ci skip] Adding data files from Travis build #159 2016-08-23 22:14:07 +00:00
Al
58851a9088 [normalization] Adding NORMALIZE_STRING_SIMPLE_LATIN_ASCII option so parser can normalize punctuation and HTML entities, etc. without touching the alphanumeric parts of the original input 2016-08-21 19:45:32 -04:00
Al
8b9702b43d [error handling] Checking that resize succeeded in transliterate.c 2016-08-21 19:43:09 -04:00
Al
2644fed18f [transliteration] Adding LATIN_ASCII_SIMPLE constant to transliterate.h 2016-08-21 19:42:10 -04:00
Al
4375bdea3b [transliteration] strduping transliterator name while building table 2016-08-21 19:41:34 -04:00
Al
bde8776bc2 [transliteration] Regenerating transliteration data files 2016-08-21 19:41:11 -04:00
Al
330edc2c93 [utils] cstring_array_get_phrase requires a char_array to be passed in so it doesn't have to do any memory allocation 2016-08-16 13:11:45 -04:00
Al
92e66fd60c [utils] string_next_hyphen_index 2016-08-16 12:49:52 -04:00
Al
3137ef5c6a [build] configure/Makefile changes to use SIMD exp and BLAS when available 2016-08-06 00:43:24 -04:00
Al
59e28c6c2a [math] double_array definition in collections.h to use new vectorized exp 2016-08-06 00:40:38 -04:00
Al
46cd725c13 [math] Generic dense matrix implementation using BLAS calls for matrix-matrix multiplication if available 2016-08-06 00:40:01 -04:00
Al
d4a792f33c [math] Adding fast SIMD exponent using the Remez algorithm for vectorized exp 2016-08-06 00:31:16 -04:00
Al
161f18575d [utils] Adding realloc checks to vector implementation 2016-08-05 23:02:52 -04:00
Al
20aad99a38 [parser] enum just lists boundary types 2016-07-30 17:07:23 -04:00
Al
965bac1833 [trie] Making methods to construct string phrases from phrase matches available through trie_search.h 2016-07-30 17:06:20 -04:00
Al
08f39d6b80 [parser] Adding address_parser_rewind to make multiple passes through the file when compiling the phrase tries 2016-07-28 17:13:58 -04:00
Al
1b09b7f2e5 [fix] Adding country_region to address_parser_train 2016-07-28 16:18:32 -04:00
Al
c6af5cc071 [parser] Adding country_region label to parser as a boundary component 2016-07-28 15:19:48 -04:00
Tom Davis
18c8e90eb3 Use xargs to start workers as soon as possible 2016-07-27 17:46:44 -04:00