libpostal

tommy/libpostal

Fork 0

0558475a50 [language_classifier] Language classifier structs, I/O and API Al 2016-01-10 01:20:17 -05:00
b85e454a58 [fix] var Al 2016-01-09 03:43:53 -05:00
b13462f8ef [language_classifier] Features for address languages classification, quadgrams for most languages, unigrams for ideographic characters, script for single-script languages like Thai, Hebrew, etc. Al 2016-01-09 03:42:57 -05:00
29930fa7b6 [fix] sort hash keys by value Al 2016-01-09 03:38:25 -05:00
62017fd33d [optimization] Using sparse updates in stochastic gradient descent. Decomposing the updates into the gradient of the loss function (zero for features not observed in the current batch) and the gradient of the regularization term. The derivative of the regularization term in L2-regularized models is equivalent to an exponential decay function. Before computing the gradient for the current batch, we bring the weights up to date only for the features observed in that batch, and update only those values Al 2016-01-09 03:12:54 -05:00
aa22db11b2 [math] Matrix arithmetic Al 2016-01-09 01:45:10 -05:00
197b18f3cf [fix] NULL check Al 2016-01-09 01:43:25 -05:00
9c4b5ccbb1 [math] Adding array_{op}_times_scalar methods Al 2016-01-09 01:42:54 -05:00
2f1e2139ca [math] Unique columns as array for CSR sparse matrix Al 2016-01-09 01:40:26 -05:00
023c04d78f [classification] Pre-allocating memory in logistic regression trainer, storing last updated timestamps for sparse stochastic gradient descent and using the new gradient API Al 2016-01-09 01:39:24 -05:00
562cc06eaf [classification] Sparse version of logistic regression gradient which, given an array of the features/columns used in the input batch, only updates the gradient for that batch, even for the operations which otherwise would apply to the entire matrix (scaling by -1/m, regularization) Al 2016-01-09 01:33:33 -05:00
5ca4bba1d5 [fix] Writing matrix dimension as 64-bit Al 2016-01-08 01:29:52 -05:00
8f054eeeb1 [classification] Training structures for logistic regression and stochastic (minibatch) gradient descent update Al 2016-01-08 01:06:02 -05:00
4acf10c3a4 [classification] Multinomial logistic regression, gradient and cost function Al 2016-01-08 01:03:09 -05:00
8b70529711 [optimization] Stochastic gradient descent with gain schedule a la Leon Bottou Al 2016-01-08 00:54:17 -05:00
6b164d263e [math] Sparse matrix from dense Al 2016-01-08 00:48:57 -05:00
ba8fc716df [features] Functions for dealing with minibatches Al 2016-01-08 00:48:11 -05:00
06638d2885 [fix] only strdup when necessary in feature counting functions Al 2016-01-08 00:46:41 -05:00
31a3a2a3fa [math] Matrix scalar arithmetic functions Al 2016-01-08 00:44:33 -05:00
b6ce94166b [sparse] Only increase size of sparse matrix on finalize row if it needs to be Al 2016-01-07 13:19:22 -05:00
2e67afab09 [fix] adding functions to string_utils header Al 2016-01-06 23:03:16 -05:00
a8b9a2c153 [fix] making *_hash_sort_keys_by_value static Al 2016-01-06 23:01:00 -05:00
0d5cf0d6d7 [utils] char_array_cat_printf was forcing a doubling of the size of the buffer, which is bad if calling many times. Now only initiates a realloc if the char_array is almost full. Also adding cstring_array_from_strings which takes a list of char *s Al 2016-01-06 22:56:01 -05:00
8c019998d7 [phrases] trie_num_keys Al 2016-01-05 22:02:15 -05:00
22668945cb [mv] Moving trie_new_from_hash to a module Al 2016-01-05 16:43:17 -05:00
33e9a05ebf [tokenization] is_whitespace Al 2016-01-05 16:40:35 -05:00
6e1435ac48 [features] No copy versions of feature counts functions Al 2016-01-05 16:39:50 -05:00
a740417cab [utils] Adding hash sort by values for numeric types Al 2016-01-05 14:37:38 -05:00
6ef7c90278 [fix] using string_equals, handles NULLs Al 2016-01-05 14:08:10 -05:00
c0214d6023 [fix] free normalized string in address parser data set Al 2016-01-05 14:06:03 -05:00
6a5ad96a17 [math] Adding vector sort and vector argsort to numeric vectors Al 2016-01-05 10:55:55 -05:00
7aea79281e [math] Floating point equality with relative epsilon comparisons Al 2016-01-02 15:39:29 -05:00
81624f8b6d [dictionaries] All professional suffixes should use the abbreviated form as the canonical Al 2015-12-31 13:14:20 -05:00
780966a59b [api] More spacing fixes and using language information in normalize string Al 2015-12-31 03:52:14 -05:00
ff75c5cc50 [normalize] Adding normalize_string_languages method which can use additional transliterators Al 2015-12-31 03:50:33 -05:00
7906f5542d [dictionaries] ulitsa is the proper transliteration for Russian Al 2015-12-31 03:49:51 -05:00
9335d26fbd [fix] spacing Al 2015-12-31 01:48:38 -05:00
7bd1336b3b [fix] Freeing languages in Python Al 2015-12-31 01:46:04 -05:00
cc89b768d8 [dictionaries] New Japanese abbreviations from the OSM wiki Al 2015-12-31 01:32:42 -05:00
ffe9c2a971 [dictionaries] Santi/SS in Italian Al 2015-12-31 01:32:21 -05:00
ecfdbc3ec2 [dictionaries] New German toponym abbreviations from the OSM wiki Al 2015-12-31 01:32:00 -05:00
a6f7924f12 [dictionaries] Adding service road to English Al 2015-12-31 01:31:27 -05:00
684c238ca0 [dictionaries] Adding no to English ambiguous Al 2015-12-31 01:31:01 -05:00
1b0567a881 [fix] Ubuntu build Al 2015-12-28 17:19:50 -05:00
77ccd975c4 [fix] #endif Al 2015-12-28 17:03:12 -05:00
d0b5985cb7 [build] Adding /usr/local/lib and /usr/local/include to sparkey build Al 2015-12-28 16:56:10 -05:00
508459a9f9 [build] Adding -L/usr/local/lib to LDFLAGS before searching for snappy Al 2015-12-28 16:54:13 -05:00
d6362ba0fc [docs] Fleshing out parser description, correcting city name in Russian address Al 2015-12-28 15:46:49 -05:00
45b5e2dd6f [fix] array_zero Al 2015-12-28 01:24:27 -05:00
fb4c984f15 [math] sparse_matrix_new_shape Al 2015-12-28 01:20:23 -05:00
72ad01cbc3 [features] Using a str=>double hashtable for feature counts Al 2015-12-28 01:18:49 -05:00
e4dba2297d [mv] Moving token type checking to header Al 2015-12-28 01:16:56 -05:00
0fa1c2389c [fix] Leak in expanding strings that have a separable prefix and suffix, other than that ran through 78 million expansions with no discernable memory issues Al 2015-12-26 17:19:52 -05:00
deeb8f007e [fix] Check for result.len > 0 in false start continuation numex parsing, plus additional safety check during replacement Al 2015-12-24 02:26:29 -05:00
507dd631f8 [build] Adding json_encode.c to the address parser client sources Al 2015-12-23 19:37:28 -05:00
5e6d24ff7e [unicode] Upgrading to latest utf8proc from JuliaLang (Unicode 8) Al 2015-12-23 19:30:52 -05:00
3fbb3c587a [fix] using a char_array instead of copying the string in normalize_string Al 2015-12-23 19:21:54 -05:00
2eea999692 [fix] Fixing false start continuations in numex parsing Al 2015-12-23 19:19:04 -05:00
850d82de6e [fix] In trie search, moving fall-off and tail checks inside the inner character loop dding tail position as a separate variable from offset in the string Al 2015-12-23 19:16:43 -05:00
19173d3a6e [transliteration] In set match checks, use the current index, not current index - char_len Al 2015-12-23 13:12:30 -05:00
e9e05bb929 [transliteration] Distinguishing between variables with numbers and backreferences in transliteration rules Al 2015-12-23 13:04:39 -05:00
aaa1fc0387 [fix] Stepping through codepoints first then through chars in trie_search_prefixes_from_index (used in transliteration and numex) Al 2015-12-23 01:58:39 -05:00
baa8e3cc3f [fix] Compare the remaining part of the current UTF-8 character using simple string comparison, since it may be in the middle of a valid UTF-8 character Al 2015-12-21 20:34:15 -05:00
57040b8733 [docs] README fixes Al 2015-12-21 17:45:49 -05:00
ceda863e9f [fix] Encode strings as JSON in address parser cli Al 2015-12-21 17:45:06 -05:00
e55ff54be1 [fix] Adding Korean-Latin-BGN to excluded transliterators Al 2015-12-21 16:23:58 -05:00
c7fb7f685d [transliteration] Fixing group replacement in transliteration in the case of multiple groups, not adding to phrase length when checking context Al 2015-12-21 16:06:04 -05:00
682c316775 [transliteration] Removing Korean-Latin-BGN, not a great transliterator and AFAICT, ICU doesn't use it either Al 2015-12-21 12:45:45 -05:00
ab124465e6 [fix] regenerating transliteration data Al 2015-12-20 15:41:42 -05:00
ccf509edb1 [fix] update to control characters for generating the transliteration rules Al 2015-12-20 15:40:38 -05:00
5439f4679f [fix] Special tokens like emails/urls/phone numbers bypass normalization Al 2015-12-20 03:07:36 -05:00
cf2a0efa11 [fix] Prefixes and suffixes that are the same length as the original token should be handled as regular expansions Al 2015-12-19 17:29:26 -05:00
aaecd7961a [fix] Options out of order Al 2015-12-19 15:05:50 -05:00
48cb2b5c7b [api] Node was complaining about non-trivial designated initializers (probably the bit fields), so converting to old-school initializer Al 2015-12-19 02:34:31 -05:00
97906c86a8 [fix] Strip punctuation in final output in cases where there are no expansions Al 2015-12-19 02:10:41 -05:00
4497c4501e [fix] do not add a token if prefix/suffix expansions are inseparable and canonical Al 2015-12-19 01:36:02 -05:00
f8da44e8b0 [fix] Making a copy even on pure Latin-script transliteration since string_trim modifies in-place, occasionally causes issues Al 2015-12-19 01:31:52 -05:00
39e83961ef [fix] Bug in suffix expansion affecting inseparable suffixes like burg as well as ordinal suffixes like first=>1st Al 2015-12-19 01:29:49 -05:00
b2a944830a [transliteration] Making sure the Python script to generate transliteration data works on the new CLDR format Al 2015-12-19 00:34:30 -05:00
b4a8a69226 [expansion] Fixing extra space on prefix/suffix expansions Al 2015-12-18 20:28:59 -05:00
df47dad817 [fix] Partial matches, ultimate misses in concatenated suffixes Al 2015-12-18 17:36:58 -05:00
66073c17d5 [fix] Handling case of concatenated suffixes like straße when they stand alone Al 2015-12-18 17:17:35 -05:00
b71755bf7f [fix] Moving Python bindings up-front in the README Al 2015-12-17 14:28:36 -05:00
31ed88bf6a [api] Adding a --json option to expand cli Al 2015-12-17 13:46:55 -05:00
41ea105bb4 [api] Simple JSON encoding for strings, UTF-8 rather than Unicode Al 2015-12-17 12:24:40 -05:00
af78614f62 [fix] Print usage info on -h/--help to libpostal cli Al 2015-12-16 22:21:13 -05:00
f4ee9c2645 [fix] task list Al 2015-12-16 20:38:29 -05:00
54cc1b8b2d [fix] Python syntax highlighting for README instructions Al 2015-12-16 02:25:56 -05:00
f3b4a4e894 Merge pull request #11 from nvkelso/master Al Barrentine 2015-12-16 02:22:55 -05:00
59cc6d3417 [docs] README updates, better explanations of normalization and parsing Al 2015-12-16 02:19:10 -05:00
11a9c47cea Merge pull request #1 from nvkelso/nvkelso/readme-translit-typo Nathaniel V. KELSO 2015-12-15 22:45:35 -08:00
7ff7027cdb andthus > and thus in Transliteration section Nathaniel V. KELSO 2015-12-15 22:45:07 -08:00
3e44910664 [fix] Note about ldconfig Al 2015-12-16 00:48:22 -05:00
ef941a6634 [fix] README parses Al 2015-12-15 16:18:22 -05:00
c787821e96 [fix] README Al 2015-12-15 16:16:16 -05:00
6cccc3ee46 [fix] README addition Al 2015-12-15 16:07:21 -05:00
d1833a8f8f [docs] Updating README with parsing info/examples Al 2015-12-15 16:00:58 -05:00
83ba053373 [build] Removing setup.py fanciness. Install the C library first, then run setup.py or pip install Al 2015-12-15 14:31:58 -05:00
e0c0ed2d04 [numex] Return true if numex table already loaded Al 2015-12-15 14:28:40 -05:00
7e04017851 [fix] default for libdir Al 2015-12-15 12:21:49 -05:00

Commit Graph Select branches Hide Pull Requests main master Mono Color

Commit Graph

Select branches

Hide Pull Requests

main

master