Al
|
2d49369e78
|
[utils] Adding read/write for 64-bit ints to file_utils
|
2015-05-13 17:51:03 -04:00 |
|
Al
|
6898f8ecd9
|
[transliteration] Adding context types back to transtlieration rule struct since they don't matter in the actual transliteration table
|
2015-05-13 16:51:07 -04:00 |
|
Al
|
b777b60e07
|
[transliteration] new data file
|
2015-05-13 16:21:16 -04:00 |
|
Al
|
304dc9525a
|
[transliteration] fixing variable assignments, literal wide characters (for narrow Python builds), ignoring rules related to spaced Han
|
2015-05-13 16:20:52 -04:00 |
|
Al
|
cbe83376f2
|
[transliteration] Adding new, even smaller, generated data file
|
2015-05-12 18:58:38 -04:00 |
|
Al
|
5bbf71ccbb
|
[transliteration] Using breadth-first search for tracking dependencies between transforms, removing Han-Spacedhan since our tokenizer does the equivalent already
|
2015-05-12 18:57:57 -04:00 |
|
Al
|
b55db5fcda
|
[fix] usage text
|
2015-05-12 12:15:51 -04:00 |
|
Al
|
d5f9d8a29a
|
[mv] unicode_scripts => unicode_properties
|
2015-05-12 12:14:59 -04:00 |
|
Al
|
0984fb9ea4
|
[transliteration] new, more compact transliteration data file
|
2015-05-12 12:13:11 -04:00 |
|
Al
|
ff0e7cb9e1
|
[i18n] downloading several files from the Unicode Character Database
|
2015-05-12 12:12:17 -04:00 |
|
Al
|
3814af52ec
|
[transliteration] Python script now implements the full TR-35 spec, including filter rules, which cuts down significantly on the size of the data file and complexity of generating the trie
|
2015-05-12 12:10:15 -04:00 |
|
Al
|
fe044cebef
|
[transliteration] char set mapping for some of the more complicated sets found in CLDR
|
2015-05-10 18:34:53 -04:00 |
|
Al
|
2a69488f9b
|
[fix] for transliteration rules, allowing the parsing of set differencees and arbitrarily nested character set expressions, using non-NUL byte for the empty transition. Adding resulting data file.
|
2015-05-08 17:14:26 -04:00 |
|
Al
|
10ebaf147a
|
[transliteration] literal ^ and $ escaped
|
2015-05-01 19:16:36 -04:00 |
|
Al
|
ff851a464c
|
[fix] escaping curly braces for regex compilation
|
2015-04-30 13:27:17 -04:00 |
|
Al
|
fa43abd8d9
|
[transliteration] For ruleset steps in transliteration, the name is just the step number, which can be appended to the trie as part of the key
|
2015-04-29 14:31:15 -04:00 |
|
Al
|
1c25238af7
|
[fix] string lengths on the various transliteration rules
|
2015-04-27 13:51:35 -04:00 |
|
Al
|
1373843b86
|
[fix] setting last_node in tokenized trie search in the case where a prefix phrase matches but the longer string doesn't.
|
2015-04-27 01:49:08 -04:00 |
|
Al
|
b2ba629f95
|
[fix] trie_get methods just return node index rather than data value
|
2015-04-27 01:28:05 -04:00 |
|
Al
|
8fb9bacfa6
|
[phrases] New trie_add_nodes_only method for concatenating strings to the trie, plus boolean return values on trie_add_* APIs
|
2015-04-27 01:01:43 -04:00 |
|
Al
|
8bc77372ef
|
[phrases] exposing trie_add_at_index and trie_get_from_index for more control in the transliteration tries
|
2015-04-26 22:24:02 -04:00 |
|
Al
|
6ebea11640
|
[transliteration] fixing transliteration rules, fixing escape characters, adding sizes to all the strings as they may have null characters
|
2015-04-26 19:47:54 -04:00 |
|
Al
|
ff9b6735f8
|
[transliteration] Adding header + generated C data file for simplified transliteration rules
|
2015-04-25 15:44:36 -04:00 |
|
Al
|
be29874f13
|
[transliteration] Parser for CLDR transforms to generate (simple) C transform rules
|
2015-04-25 15:42:21 -04:00 |
|
Al
|
1b33744956
|
[tokenization] Numeric tokens must end in number or letter
|
2015-04-22 14:55:18 -04:00 |
|
Al
|
9c0126a01c
|
[utils] two set types in collections.h
|
2015-04-19 09:32:53 -04:00 |
|
Al
|
908e3dc03c
|
[phrases] trie_search now only takes the original string and the token array. Fixed a bug where certain phrases were being found in string search but not in tokenized search
|
2015-04-19 09:32:20 -04:00 |
|
Al
|
606a669c01
|
[tokenization] breaking dashes or double hyphens break a word while other dashes don't
|
2015-04-17 19:14:42 -04:00 |
|
Al
|
6718182443
|
[tokenization] non-breaking dashes can be mid-word, em-dashes, etc. break words
|
2015-04-17 15:21:22 -04:00 |
|
Al
|
e21873635c
|
[utils] Using token offsets to calculate lengths for contiguous string arrays, inlining a few functions
|
2015-04-15 20:17:03 -04:00 |
|
Al
|
24e62b1c6c
|
[tokenization] Script to generate TR-29 ranges for re2c scanner
|
2015-04-14 15:50:50 -04:00 |
|
Al
|
5fa03587fb
|
[cldr] simple Python scanner for creating dynamic scanners for CLDR rule parsing
|
2015-04-14 15:49:24 -04:00 |
|
Al
|
efdcbc9eef
|
[project] adding a Python .gitignore for scripts, Python lib, etc.
|
2015-04-14 15:48:43 -04:00 |
|
Al
|
6e9295154a
|
[fix] local dirs for cldr data
|
2015-04-14 15:46:15 -04:00 |
|
Al
|
744231c148
|
[fix] cldr supplemental uses local copy
|
2015-04-13 19:03:44 -04:00 |
|
Al
|
a8b9981c9b
|
[fix] vars
|
2015-04-13 19:03:14 -04:00 |
|
Al
|
d1267145f7
|
[fix] args to wget
|
2015-04-13 19:02:50 -04:00 |
|
Al
|
d771da7c78
|
[i18n] unicode scripts file downloaded and cached locally
|
2015-04-13 19:02:29 -04:00 |
|
Al
|
cc4d2d08eb
|
[cldr] Adding script to download latest cldr release instead of pulling from the repo
|
2015-04-13 01:03:15 -04:00 |
|
Al
|
e241c1dfc8
|
[rm] Removing dependency on sds, char_array and cstring_array have similar benefits/functionality with fewer drawbacks
|
2015-04-12 18:07:33 -04:00 |
|
Al
|
83813bb980
|
[geodisambig] Models for geonames with msgpack serialization/deserialization
|
2015-04-12 16:47:01 -04:00 |
|
Al
|
acb575c84c
|
[fix] splitting out methods for unicode scripts
|
2015-04-12 15:21:23 -04:00 |
|
Al
|
1f9da05dd5
|
[geodisambig] C msgpack serialization dependency
|
2015-04-12 15:14:01 -04:00 |
|
Al
|
0234754c20
|
[fix] warnings in string_utils
|
2015-04-12 12:16:32 -04:00 |
|
Al
|
d50d7d182e
|
[fix] geonames import script for admin 1 codes
|
2015-04-12 12:16:08 -04:00 |
|
Al
|
888baa86f3
|
[fix] English dictionaries
|
2015-04-12 12:15:47 -04:00 |
|
Al
|
3a7f18581e
|
[utils] Adding min, max, argmin, argmax and log_sum_exp to generic vector math header
|
2015-04-12 12:11:04 -04:00 |
|
Al
|
fdd0c489f3
|
[fix] refactoring unicode script fetching into more reusable functions
|
2015-04-09 02:18:13 -04:00 |
|
Al
|
4729dfe178
|
[utils] string_[rl]strip => string_[rl]trim, removing warning about allocation
|
2015-04-06 02:19:19 -04:00 |
|
Al
|
53844067b1
|
[fix] better allocation sizes for tokenized strings
|
2015-04-05 22:02:31 -04:00 |
|