Al
|
4b35da629f
|
[numex] regenerated numex data file
|
2016-11-30 15:58:55 -08:00 |
|
Al
|
4677874610
|
[parser] stripping postal codes of phrases like CP (in Spanish) before adding them to the gazetteers, whether it's concatenated or a separate token. Adding a command-line argument for the number of iterations
|
2016-11-30 15:58:03 -08:00 |
|
Al
|
0e29cdd9fd
|
[parser] fixing some uninitialized value issues during parser training
|
2016-11-30 15:42:09 -08:00 |
|
Al
|
f5a6bd0f36
|
[fix] sparse_matrix_new_from_matrix uses new matrix types
|
2016-11-30 10:15:12 -08:00 |
|
Al
|
b639fa5127
|
[utils] string_replace also creates a copy
|
2016-11-30 10:09:33 -08:00 |
|
Al
|
89f6611c4e
|
[strings] string_trim makes a copy rather than modifying the pointer
|
2016-11-28 15:06:07 -08:00 |
|
Al
|
d922d9a60a
|
[expansion] regenerated address_expansion_data.c
|
2016-11-28 10:47:15 -08:00 |
|
Al
|
f78281456a
|
[fix] header defintion
|
2016-11-27 01:00:25 -08:00 |
|
Al
|
eea11beb6a
|
[expansion] using easier-to-access data structure for address dictionaries
|
2016-11-27 00:56:48 -08:00 |
|
Al
|
7298c895c8
|
[utils] adding a chunked shuffle as the concatenated file sizes may get larger than memory
|
2016-11-21 14:04:34 -05:00 |
|
Al
|
58851a9088
|
[normalization] Adding NORMALIZE_STRING_SIMPLE_LATIN_ASCII option so parser can normalize punctuation and HTML entities, etc. without touching the alphanumeric parts of the original input
|
2016-08-21 19:45:32 -04:00 |
|
Al
|
8b9702b43d
|
[error handling] Checking that resize succeeded in transliterate.c
|
2016-08-21 19:43:09 -04:00 |
|
Al
|
2644fed18f
|
[transliteration] Adding LATIN_ASCII_SIMPLE constant to transliterate.h
|
2016-08-21 19:42:10 -04:00 |
|
Al
|
4375bdea3b
|
[transliteration] strduping transliterator name while building table
|
2016-08-21 19:41:34 -04:00 |
|
Al
|
bde8776bc2
|
[transliteration] Regenerating transliteration data files
|
2016-08-21 19:41:11 -04:00 |
|
Al
|
330edc2c93
|
[utils] cstring_array_get_phrase requires a char_array to be passed in so it doesn't have to do any memory allocation
|
2016-08-16 13:11:45 -04:00 |
|
Al
|
92e66fd60c
|
[utils] string_next_hyphen_index
|
2016-08-16 12:49:52 -04:00 |
|
Al
|
3137ef5c6a
|
[build] configure/Makefile changes to use SIMD exp and BLAS when available
|
2016-08-06 00:43:24 -04:00 |
|
Al
|
59e28c6c2a
|
[math] double_array definition in collections.h to use new vectorized exp
|
2016-08-06 00:40:38 -04:00 |
|
Al
|
46cd725c13
|
[math] Generic dense matrix implementation using BLAS calls for matrix-matrix multiplication if available
|
2016-08-06 00:40:01 -04:00 |
|
Al
|
d4a792f33c
|
[math] Adding fast SIMD exponent using the Remez algorithm for vectorized exp
|
2016-08-06 00:31:16 -04:00 |
|
Al
|
161f18575d
|
[utils] Adding realloc checks to vector implementation
|
2016-08-05 23:02:52 -04:00 |
|
Al
|
20aad99a38
|
[parser] enum just lists boundary types
|
2016-07-30 17:07:23 -04:00 |
|
Al
|
965bac1833
|
[trie] Making methods to construct string phrases from phrase matches available through trie_search.h
|
2016-07-30 17:06:20 -04:00 |
|
Al
|
08f39d6b80
|
[parser] Adding address_parser_rewind to make multiple passes through the file when compiling the phrase tries
|
2016-07-28 17:13:58 -04:00 |
|
Al
|
1b09b7f2e5
|
[fix] Adding country_region to address_parser_train
|
2016-07-28 16:18:32 -04:00 |
|
Al
|
c6af5cc071
|
[parser] Adding country_region label to parser as a boundary component
|
2016-07-28 15:19:48 -04:00 |
|
Al
|
64f167f045
|
[tokenization] Re-generating scanner
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
81b4a4a1cb
|
[tokenization] Hyphens, etc. between non-ASCII digits (e.g. Unicode full-width numbers) should be single tokens
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
be5fd79a48
|
[expansion] Prefix/suffix expansions by default can apply to ADDRESS_ANY but also inherit the types of any dictionary that lists their canonical form (so we can add suffixes without worrying about whether they're for streets or place names, etc.)
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
8926293063
|
[parser/cli] Using NFC normalization on the output in the parser client (closes #30). Optional command-line arg for parser output dir, useful for spot-checking different experiments
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
44908ff95a
|
[parser] No digit normalization in training data-derived parser phrases (for postcodes, etc.), phrases include the new island type, house number phrases if any are valid. Adjacent words are now full phrases if they are part of a multiword token like a city name. For hyphenated names like Carmel-by-the-Sea, adding a version to the phrase dictionary where the hyphens are replaced with spaces
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
41ae742285
|
[fix] tokenized trie search when falling off the trie at the start of a valid phrase
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
6e60b3bbda
|
[fix] semicolon in #define
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
b5d4dd6f37
|
[tokenization] Including full-width numbers in numeric tokens
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
dd7ef6fabf
|
[dictionaries] Making new component for near/nearby prepositions
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
2454b98c6d
|
[tokenization] Reverting commit for tokenizing initial/final apostrophes as part of words as it may be more effective to handle during post-processing
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
0a8f46bdc3
|
[parser] Using new geonames designations in parser features
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
c383f8af88
|
[parser] Using NFC normalization for parser as well, @ sign not defined as separator since it may also be used in intersections
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
c2ee5a45b3
|
[geodb] Adding separate bitset for geonames place types and using NFC normalization instead of NFD (requires retraining)
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
6c39c663ff
|
[normalize] Adding NORMALIZE_STRING_COMPOSE for NFC unicode normalization
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
757c6147cb
|
[tokenization] Adding ability to tokenize 's Gravenhage
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
2e8888e331
|
[fix] warnings/size_t in libpostal.c
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
e800f21f06
|
[gazetteers] Adding new gazetteer types/address components
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
e5e0cf3b92
|
[fix] loading transliteration module in address_parser_test.c as well
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
b8d43dc601
|
[fix] cstring_array_split calls
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
b19cd3f60a
|
[fix] brace
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
994b2f18e4
|
[parser] Ignore multiple spaces in parser input post-normalization. If normalizing the string creates several distinct tokens (namely in Vulgar fractions e.g. ½ => 1/2), add all the sub-tokens with the same label as the parent
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
b664ab1cea
|
[utils] Adding cstring_array_split_ignore_consecutive
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
8e90ee45d2
|
[fix] calls and NULL checks
|
2016-07-21 17:04:57 -04:00 |
|