Al
|
7298c895c8
|
[utils] adding a chunked shuffle as the concatenated file sizes may get larger than memory
|
2016-11-21 14:04:34 -05:00 |
|
Al
|
58851a9088
|
[normalization] Adding NORMALIZE_STRING_SIMPLE_LATIN_ASCII option so parser can normalize punctuation and HTML entities, etc. without touching the alphanumeric parts of the original input
|
2016-08-21 19:45:32 -04:00 |
|
Al
|
8b9702b43d
|
[error handling] Checking that resize succeeded in transliterate.c
|
2016-08-21 19:43:09 -04:00 |
|
Al
|
2644fed18f
|
[transliteration] Adding LATIN_ASCII_SIMPLE constant to transliterate.h
|
2016-08-21 19:42:10 -04:00 |
|
Al
|
4375bdea3b
|
[transliteration] strduping transliterator name while building table
|
2016-08-21 19:41:34 -04:00 |
|
Al
|
bde8776bc2
|
[transliteration] Regenerating transliteration data files
|
2016-08-21 19:41:11 -04:00 |
|
Al
|
330edc2c93
|
[utils] cstring_array_get_phrase requires a char_array to be passed in so it doesn't have to do any memory allocation
|
2016-08-16 13:11:45 -04:00 |
|
Al
|
92e66fd60c
|
[utils] string_next_hyphen_index
|
2016-08-16 12:49:52 -04:00 |
|
Al
|
3137ef5c6a
|
[build] configure/Makefile changes to use SIMD exp and BLAS when available
|
2016-08-06 00:43:24 -04:00 |
|
Al
|
59e28c6c2a
|
[math] double_array definition in collections.h to use new vectorized exp
|
2016-08-06 00:40:38 -04:00 |
|
Al
|
46cd725c13
|
[math] Generic dense matrix implementation using BLAS calls for matrix-matrix multiplication if available
|
2016-08-06 00:40:01 -04:00 |
|
Al
|
d4a792f33c
|
[math] Adding fast SIMD exponent using the Remez algorithm for vectorized exp
|
2016-08-06 00:31:16 -04:00 |
|
Al
|
161f18575d
|
[utils] Adding realloc checks to vector implementation
|
2016-08-05 23:02:52 -04:00 |
|
Al
|
20aad99a38
|
[parser] enum just lists boundary types
|
2016-07-30 17:07:23 -04:00 |
|
Al
|
965bac1833
|
[trie] Making methods to construct string phrases from phrase matches available through trie_search.h
|
2016-07-30 17:06:20 -04:00 |
|
Al
|
08f39d6b80
|
[parser] Adding address_parser_rewind to make multiple passes through the file when compiling the phrase tries
|
2016-07-28 17:13:58 -04:00 |
|
Al
|
1b09b7f2e5
|
[fix] Adding country_region to address_parser_train
|
2016-07-28 16:18:32 -04:00 |
|
Al
|
c6af5cc071
|
[parser] Adding country_region label to parser as a boundary component
|
2016-07-28 15:19:48 -04:00 |
|
Al
|
64f167f045
|
[tokenization] Re-generating scanner
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
81b4a4a1cb
|
[tokenization] Hyphens, etc. between non-ASCII digits (e.g. Unicode full-width numbers) should be single tokens
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
be5fd79a48
|
[expansion] Prefix/suffix expansions by default can apply to ADDRESS_ANY but also inherit the types of any dictionary that lists their canonical form (so we can add suffixes without worrying about whether they're for streets or place names, etc.)
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
8926293063
|
[parser/cli] Using NFC normalization on the output in the parser client (closes #30). Optional command-line arg for parser output dir, useful for spot-checking different experiments
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
44908ff95a
|
[parser] No digit normalization in training data-derived parser phrases (for postcodes, etc.), phrases include the new island type, house number phrases if any are valid. Adjacent words are now full phrases if they are part of a multiword token like a city name. For hyphenated names like Carmel-by-the-Sea, adding a version to the phrase dictionary where the hyphens are replaced with spaces
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
41ae742285
|
[fix] tokenized trie search when falling off the trie at the start of a valid phrase
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
6e60b3bbda
|
[fix] semicolon in #define
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
b5d4dd6f37
|
[tokenization] Including full-width numbers in numeric tokens
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
dd7ef6fabf
|
[dictionaries] Making new component for near/nearby prepositions
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
2454b98c6d
|
[tokenization] Reverting commit for tokenizing initial/final apostrophes as part of words as it may be more effective to handle during post-processing
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
0a8f46bdc3
|
[parser] Using new geonames designations in parser features
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
c383f8af88
|
[parser] Using NFC normalization for parser as well, @ sign not defined as separator since it may also be used in intersections
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
c2ee5a45b3
|
[geodb] Adding separate bitset for geonames place types and using NFC normalization instead of NFD (requires retraining)
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
6c39c663ff
|
[normalize] Adding NORMALIZE_STRING_COMPOSE for NFC unicode normalization
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
757c6147cb
|
[tokenization] Adding ability to tokenize 's Gravenhage
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
2e8888e331
|
[fix] warnings/size_t in libpostal.c
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
e800f21f06
|
[gazetteers] Adding new gazetteer types/address components
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
e5e0cf3b92
|
[fix] loading transliteration module in address_parser_test.c as well
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
b8d43dc601
|
[fix] cstring_array_split calls
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
b19cd3f60a
|
[fix] brace
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
994b2f18e4
|
[parser] Ignore multiple spaces in parser input post-normalization. If normalizing the string creates several distinct tokens (namely in Vulgar fractions e.g. ½ => 1/2), add all the sub-tokens with the same label as the parent
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
b664ab1cea
|
[utils] Adding cstring_array_split_ignore_consecutive
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
8e90ee45d2
|
[fix] calls and NULL checks
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
e3cffaf0d1
|
[fix] tokenized_string_t should copy its source string
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
16501aba17
|
[fix] Need to load transliteration module for Latin-ASCII normalization
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
a9ba61585b
|
[fix] Adding set -e to data download script so it fails if any subcommands fail
|
2016-05-04 23:08:06 -04:00 |
|
Al
|
9819ebf949
|
[fix] always include expansions in the ambiguous expansion dictionary, no matter which component
|
2016-04-29 13:26:13 -04:00 |
|
Al
|
0bc3550c11
|
[expansion] Adding address_expansion_in_dictionary
|
2016-04-29 13:23:48 -04:00 |
|
Al
|
59e5fcd1b4
|
[fix] LC_ALL=C in data download script
|
2016-04-11 12:47:50 -04:00 |
|
Travis
|
b8d4d71522
|
[auto][ci skip] Adding data files from Travis build #112
|
2016-03-30 20:04:52 +00:00 |
|
Al
|
14e8f50cf1
|
[fix] Expansions when passing in the address_components= option. Was only limiting results at the phrase level, should work at the individual expansion level
|
2016-03-29 16:46:29 -04:00 |
|
Travis
|
2795d258d1
|
[auto][ci skip] Adding data files from Travis build #108
|
2016-03-29 19:11:57 +00:00 |
|