b13462f8ef[language_classifier] Features for address languages classification, quadgrams for most languages, unigrams for ideographic characters, script for single-script languages like Thai, Hebrew, etc.
Al
2016-01-09 03:42:57 -05:00
29930fa7b6[fix] sort hash keys by value
Al
2016-01-09 03:38:25 -05:00
62017fd33d[optimization] Using sparse updates in stochastic gradient descent. Decomposing the updates into the gradient of the loss function (zero for features not observed in the current batch) and the gradient of the regularization term. The derivative of the regularization term in L2-regularized models is equivalent to an exponential decay function. Before computing the gradient for the current batch, we bring the weights up to date only for the features observed in that batch, and update only those values
Al
2016-01-09 03:12:54 -05:00
aa22db11b2[math] Matrix arithmetic
Al
2016-01-09 01:45:10 -05:00
197b18f3cf[fix] NULL check
Al
2016-01-09 01:43:25 -05:00
9c4b5ccbb1[math] Adding array_{op}_times_scalar methods
Al
2016-01-09 01:42:54 -05:00
2f1e2139ca[math] Unique columns as array for CSR sparse matrix
Al
2016-01-09 01:40:26 -05:00
023c04d78f[classification] Pre-allocating memory in logistic regression trainer, storing last updated timestamps for sparse stochastic gradient descent and using the new gradient API
Al
2016-01-09 01:39:24 -05:00
562cc06eaf[classification] Sparse version of logistic regression gradient which, given an array of the features/columns used in the input batch, only updates the gradient for that batch, even for the operations which otherwise would apply to the entire matrix (scaling by -1/m, regularization)
Al
2016-01-09 01:33:33 -05:00
5ca4bba1d5[fix] Writing matrix dimension as 64-bit
Al
2016-01-08 01:29:52 -05:00
8f054eeeb1[classification] Training structures for logistic regression and stochastic (minibatch) gradient descent update
Al
2016-01-08 01:06:02 -05:00
4acf10c3a4[classification] Multinomial logistic regression, gradient and cost function
Al
2016-01-08 01:03:09 -05:00
8b70529711[optimization] Stochastic gradient descent with gain schedule a la Leon Bottou
Al
2016-01-08 00:54:17 -05:00
6b164d263e[math] Sparse matrix from dense
Al
2016-01-08 00:48:57 -05:00
ba8fc716df[features] Functions for dealing with minibatches
Al
2016-01-08 00:48:11 -05:00
06638d2885[fix] only strdup when necessary in feature counting functions
Al
2016-01-08 00:46:41 -05:00
31a3a2a3fa[math] Matrix scalar arithmetic functions
Al
2016-01-08 00:44:33 -05:00
b6ce94166b[sparse] Only increase size of sparse matrix on finalize row if it needs to be
Al
2016-01-07 13:19:22 -05:00
2e67afab09[fix] adding functions to string_utils header
Al
2016-01-06 23:03:16 -05:00
a8b9a2c153[fix] making *_hash_sort_keys_by_value static
Al
2016-01-06 23:01:00 -05:00
0d5cf0d6d7[utils] char_array_cat_printf was forcing a doubling of the size of the buffer, which is bad if calling many times. Now only initiates a realloc if the char_array is almost full. Also adding cstring_array_from_strings which takes a list of char *s
Al
2016-01-06 22:56:01 -05:00
8c019998d7[phrases] trie_num_keys
Al
2016-01-05 22:02:15 -05:00
22668945cb[mv] Moving trie_new_from_hash to a module
Al
2016-01-05 16:43:17 -05:00
33e9a05ebf[tokenization] is_whitespace
Al
2016-01-05 16:40:35 -05:00
6e1435ac48[features] No copy versions of feature counts functions
Al
2016-01-05 16:39:50 -05:00
a740417cab[utils] Adding hash sort by values for numeric types
Al
2016-01-05 14:37:38 -05:00
6ef7c90278[fix] using string_equals, handles NULLs
Al
2016-01-05 14:08:10 -05:00
c0214d6023[fix] free normalized string in address parser data set
Al
2016-01-05 14:06:03 -05:00
6a5ad96a17[math] Adding vector sort and vector argsort to numeric vectors
Al
2016-01-05 10:55:55 -05:00
7aea79281e[math] Floating point equality with relative epsilon comparisons
Al
2016-01-02 15:39:29 -05:00
81624f8b6d[dictionaries] All professional suffixes should use the abbreviated form as the canonical
Al
2015-12-31 13:14:20 -05:00
780966a59b[api] More spacing fixes and using language information in normalize string
Al
2015-12-31 03:52:14 -05:00
ff75c5cc50[normalize] Adding normalize_string_languages method which can use additional transliterators
Al
2015-12-31 03:50:33 -05:00
7906f5542d[dictionaries] ulitsa is the proper transliteration for Russian
Al
2015-12-31 03:49:51 -05:00
9335d26fbd[fix] spacing
Al
2015-12-31 01:48:38 -05:00
7bd1336b3b[fix] Freeing languages in Python
Al
2015-12-31 01:46:04 -05:00
cc89b768d8[dictionaries] New Japanese abbreviations from the OSM wiki
Al
2015-12-31 01:32:42 -05:00
ffe9c2a971[dictionaries] Santi/SS in Italian
Al
2015-12-31 01:32:21 -05:00
ecfdbc3ec2[dictionaries] New German toponym abbreviations from the OSM wiki
Al
2015-12-31 01:32:00 -05:00
a6f7924f12[dictionaries] Adding service road to English
Al
2015-12-31 01:31:27 -05:00
684c238ca0[dictionaries] Adding no to English ambiguous
Al
2015-12-31 01:31:01 -05:00
1b0567a881[fix] Ubuntu build
Al
2015-12-28 17:19:50 -05:00
77ccd975c4[fix] #endif
Al
2015-12-28 17:03:12 -05:00
d0b5985cb7[build] Adding /usr/local/lib and /usr/local/include to sparkey build
Al
2015-12-28 16:56:10 -05:00
508459a9f9[build] Adding -L/usr/local/lib to LDFLAGS before searching for snappy
Al
2015-12-28 16:54:13 -05:00
d6362ba0fc[docs] Fleshing out parser description, correcting city name in Russian address
Al
2015-12-28 15:46:49 -05:00
45b5e2dd6f[fix] array_zero
Al
2015-12-28 01:24:27 -05:00
fb4c984f15[math] sparse_matrix_new_shape
Al
2015-12-28 01:20:23 -05:00
72ad01cbc3[features] Using a str=>double hashtable for feature counts
Al
2015-12-28 01:18:49 -05:00
e4dba2297d[mv] Moving token type checking to header
Al
2015-12-28 01:16:56 -05:00
0fa1c2389c[fix] Leak in expanding strings that have a separable prefix and suffix, other than that ran through 78 million expansions with no discernable memory issues
Al
2015-12-26 17:19:52 -05:00
deeb8f007e[fix] Check for result.len > 0 in false start continuation numex parsing, plus additional safety check during replacement
Al
2015-12-24 02:26:29 -05:00
507dd631f8[build] Adding json_encode.c to the address parser client sources
Al
2015-12-23 19:37:28 -05:00
5e6d24ff7e[unicode] Upgrading to latest utf8proc from JuliaLang (Unicode 8)
Al
2015-12-23 19:30:52 -05:00
3fbb3c587a[fix] using a char_array instead of copying the string in normalize_string
Al
2015-12-23 19:21:54 -05:00
2eea999692[fix] Fixing false start continuations in numex parsing
Al
2015-12-23 19:19:04 -05:00
850d82de6e[fix] In trie search, moving fall-off and tail checks inside the inner character loop dding tail position as a separate variable from offset in the string
Al
2015-12-23 19:16:43 -05:00
19173d3a6e[transliteration] In set match checks, use the current index, not current index - char_len
Al
2015-12-23 13:12:30 -05:00
e9e05bb929[transliteration] Distinguishing between variables with numbers and backreferences in transliteration rules
Al
2015-12-23 13:04:39 -05:00
aaa1fc0387[fix] Stepping through codepoints first then through chars in trie_search_prefixes_from_index (used in transliteration and numex)
Al
2015-12-23 01:58:39 -05:00
baa8e3cc3f[fix] Compare the remaining part of the current UTF-8 character using simple string comparison, since it may be in the middle of a valid UTF-8 character
Al
2015-12-21 20:34:15 -05:00
57040b8733[docs] README fixes
Al
2015-12-21 17:45:49 -05:00
ceda863e9f[fix] Encode strings as JSON in address parser cli
Al
2015-12-21 17:45:06 -05:00
e55ff54be1[fix] Adding Korean-Latin-BGN to excluded transliterators
Al
2015-12-21 16:23:58 -05:00
c7fb7f685d[transliteration] Fixing group replacement in transliteration in the case of multiple groups, not adding to phrase length when checking context
Al
2015-12-21 16:06:04 -05:00
682c316775[transliteration] Removing Korean-Latin-BGN, not a great transliterator and AFAICT, ICU doesn't use it either
Al
2015-12-21 12:45:45 -05:00
ab124465e6[fix] regenerating transliteration data
Al
2015-12-20 15:41:42 -05:00
ccf509edb1[fix] update to control characters for generating the transliteration rules
Al
2015-12-20 15:40:38 -05:00
5439f4679f[fix] Special tokens like emails/urls/phone numbers bypass normalization
Al
2015-12-20 03:07:36 -05:00
cf2a0efa11[fix] Prefixes and suffixes that are the same length as the original token should be handled as regular expansions
Al
2015-12-19 17:29:26 -05:00
aaecd7961a[fix] Options out of order
Al
2015-12-19 15:05:50 -05:00
48cb2b5c7b[api] Node was complaining about non-trivial designated initializers (probably the bit fields), so converting to old-school initializer
Al
2015-12-19 02:34:31 -05:00
97906c86a8[fix] Strip punctuation in final output in cases where there are no expansions
Al
2015-12-19 02:10:41 -05:00
4497c4501e[fix] do not add a token if prefix/suffix expansions are inseparable and canonical
Al
2015-12-19 01:36:02 -05:00
f8da44e8b0[fix] Making a copy even on pure Latin-script transliteration since string_trim modifies in-place, occasionally causes issues
Al
2015-12-19 01:31:52 -05:00
39e83961ef[fix] Bug in suffix expansion affecting inseparable suffixes like burg as well as ordinal suffixes like first=>1st
Al
2015-12-19 01:29:49 -05:00
b2a944830a[transliteration] Making sure the Python script to generate transliteration data works on the new CLDR format
Al
2015-12-19 00:34:30 -05:00
b4a8a69226[expansion] Fixing extra space on prefix/suffix expansions
Al
2015-12-18 20:28:59 -05:00
df47dad817[fix] Partial matches, ultimate misses in concatenated suffixes
Al
2015-12-18 17:36:58 -05:00
66073c17d5[fix] Handling case of concatenated suffixes like straße when they stand alone
Al
2015-12-18 17:17:35 -05:00
b71755bf7f[fix] Moving Python bindings up-front in the README
Al
2015-12-17 14:28:36 -05:00
31ed88bf6a[api] Adding a --json option to expand cli
Al
2015-12-17 13:46:55 -05:00
41ea105bb4[api] Simple JSON encoding for strings, UTF-8 rather than Unicode
Al
2015-12-17 12:24:40 -05:00
af78614f62[fix] Print usage info on -h/--help to libpostal cli
Al
2015-12-16 22:21:13 -05:00
f4ee9c2645[fix] task list
Al
2015-12-16 20:38:29 -05:00
54cc1b8b2d[fix] Python syntax highlighting for README instructions
Al
2015-12-16 02:25:56 -05:00
f3b4a4e894Merge pull request #11 from nvkelso/master
Al Barrentine
2015-12-16 02:22:55 -05:00
59cc6d3417[docs] README updates, better explanations of normalization and parsing
Al
2015-12-16 02:19:10 -05:00
11a9c47ceaMerge pull request #1 from nvkelso/nvkelso/readme-translit-typo
Nathaniel V. KELSO
2015-12-15 22:45:35 -08:00
7ff7027cdbandthus > and thus in Transliteration section
Nathaniel V. KELSO
2015-12-15 22:45:07 -08:00
3e44910664[fix] Note about ldconfig
Al
2015-12-16 00:48:22 -05:00
ef941a6634[fix] README parses
Al
2015-12-15 16:18:22 -05:00
c787821e96[fix] README
Al
2015-12-15 16:16:16 -05:00
6cccc3ee46[fix] README addition
Al
2015-12-15 16:07:21 -05:00
d1833a8f8f[docs] Updating README with parsing info/examples
Al
2015-12-15 16:00:58 -05:00
83ba053373[build] Removing setup.py fanciness. Install the C library first, then run setup.py or pip install
Al
2015-12-15 14:31:58 -05:00
e0c0ed2d04[numex] Return true if numex table already loaded
Al
2015-12-15 14:28:40 -05:00
7e04017851[fix] default for libdir
Al
2015-12-15 12:21:49 -05:00