Commit Graph

181 Commits

Author SHA1 Message Date
Al
c39a19a352 [transliteration] New data file with the Greek/Katakana additins 2015-05-17 17:59:39 -04:00
Al
d72348d47e [transliteratin] Using a restricted set of diacritical marks relevant to Greek, variants stand in for transliterator dependencies e.g. use Katakana-Latin-BGN if Katakana-Latin cannot be found 2015-05-17 17:42:37 -04:00
Al
30db201e8a [fix] NUM_CHARS => NUM_CODEPOINTS 2015-05-17 13:53:19 -04:00
Al
1348cc8906 [transliteration] Switching the begin/end set chars 2015-05-17 12:02:46 -04:00
Al
f1cfb30209 [transliteration] generated scripts file 2015-05-17 00:00:14 -04:00
Al
b983a83a89 [transliteration] transliteration struct definitions, memory allocaiton, builder methods and I/O, stubbing transliterate method for the moment 2015-05-16 23:23:25 -04:00
Al
3a74a8c179 [transliteration] script to build transliteration table, trie, C structures, etc. from the rules 2015-05-16 23:22:16 -04:00
Al
65624c8985 [fix] vector_*_pop returns the element 2015-05-16 23:20:28 -04:00
Al
4a67294fbf [phrases] adding get_prefix methods for tries, remove add_nodes_only, fixing a few things and inlining a few functions 2015-05-16 23:19:59 -04:00
Al
e8fdd4564d [utils] adding string_tree for listing sets of token alternatives and string_tree_iterator to generate permutations over the strings, needed for transliteration and ambiguous address elements/place names 2015-05-16 23:16:10 -04:00
Al
f151a2232c [transliteration] new transliteration rules data file 2015-05-16 23:14:47 -04:00
Al
99115fa53c [transliteration] converting one of the more complicated and frequently used rules to its utf8proc equivalent, adding better support for escaped unicode characters and set differences, generating a header file indicating which unicode script/language pairs warrant various transliterators. 2015-05-16 23:13:01 -04:00
Al
5983cb6af0 [i18n] Adding NUM_SCRIPTS to the end of the scripts enum 2015-05-16 12:19:40 -04:00
Al
8699409f15 [transliteration] resulting data file 2015-05-14 16:34:49 -04:00
Al
1f3ac0c3f9 [transliteration] using a proper lexer on the entire rule to correct some parses, allowing bracketed multiple characters in sets, fixing optionals 2015-05-14 16:34:03 -04:00
Al
2d49369e78 [utils] Adding read/write for 64-bit ints to file_utils 2015-05-13 17:51:03 -04:00
Al
6898f8ecd9 [transliteration] Adding context types back to transtlieration rule struct since they don't matter in the actual transliteration table 2015-05-13 16:51:07 -04:00
Al
b777b60e07 [transliteration] new data file 2015-05-13 16:21:16 -04:00
Al
304dc9525a [transliteration] fixing variable assignments, literal wide characters (for narrow Python builds), ignoring rules related to spaced Han 2015-05-13 16:20:52 -04:00
Al
cbe83376f2 [transliteration] Adding new, even smaller, generated data file 2015-05-12 18:58:38 -04:00
Al
5bbf71ccbb [transliteration] Using breadth-first search for tracking dependencies between transforms, removing Han-Spacedhan since our tokenizer does the equivalent already 2015-05-12 18:57:57 -04:00
Al
b55db5fcda [fix] usage text 2015-05-12 12:15:51 -04:00
Al
d5f9d8a29a [mv] unicode_scripts => unicode_properties 2015-05-12 12:14:59 -04:00
Al
0984fb9ea4 [transliteration] new, more compact transliteration data file 2015-05-12 12:13:11 -04:00
Al
ff0e7cb9e1 [i18n] downloading several files from the Unicode Character Database 2015-05-12 12:12:17 -04:00
Al
3814af52ec [transliteration] Python script now implements the full TR-35 spec, including filter rules, which cuts down significantly on the size of the data file and complexity of generating the trie 2015-05-12 12:10:15 -04:00
Al
fe044cebef [transliteration] char set mapping for some of the more complicated sets found in CLDR 2015-05-10 18:34:53 -04:00
Al
2a69488f9b [fix] for transliteration rules, allowing the parsing of set differencees and arbitrarily nested character set expressions, using non-NUL byte for the empty transition. Adding resulting data file. 2015-05-08 17:14:26 -04:00
Al
10ebaf147a [transliteration] literal ^ and $ escaped 2015-05-01 19:16:36 -04:00
Al
ff851a464c [fix] escaping curly braces for regex compilation 2015-04-30 13:27:17 -04:00
Al
fa43abd8d9 [transliteration] For ruleset steps in transliteration, the name is just the step number, which can be appended to the trie as part of the key 2015-04-29 14:31:15 -04:00
Al
1c25238af7 [fix] string lengths on the various transliteration rules 2015-04-27 13:51:35 -04:00
Al
1373843b86 [fix] setting last_node in tokenized trie search in the case where a prefix phrase matches but the longer string doesn't. 2015-04-27 01:49:08 -04:00
Al
b2ba629f95 [fix] trie_get methods just return node index rather than data value 2015-04-27 01:28:05 -04:00
Al
8fb9bacfa6 [phrases] New trie_add_nodes_only method for concatenating strings to the trie, plus boolean return values on trie_add_* APIs 2015-04-27 01:01:43 -04:00
Al
8bc77372ef [phrases] exposing trie_add_at_index and trie_get_from_index for more control in the transliteration tries 2015-04-26 22:24:02 -04:00
Al
6ebea11640 [transliteration] fixing transliteration rules, fixing escape characters, adding sizes to all the strings as they may have null characters 2015-04-26 19:47:54 -04:00
Al
ff9b6735f8 [transliteration] Adding header + generated C data file for simplified transliteration rules 2015-04-25 15:44:36 -04:00
Al
be29874f13 [transliteration] Parser for CLDR transforms to generate (simple) C transform rules 2015-04-25 15:42:21 -04:00
Al
1b33744956 [tokenization] Numeric tokens must end in number or letter 2015-04-22 14:55:18 -04:00
Al
9c0126a01c [utils] two set types in collections.h 2015-04-19 09:32:53 -04:00
Al
908e3dc03c [phrases] trie_search now only takes the original string and the token array. Fixed a bug where certain phrases were being found in string search but not in tokenized search 2015-04-19 09:32:20 -04:00
Al
606a669c01 [tokenization] breaking dashes or double hyphens break a word while other dashes don't 2015-04-17 19:14:42 -04:00
Al
6718182443 [tokenization] non-breaking dashes can be mid-word, em-dashes, etc. break words 2015-04-17 15:21:22 -04:00
Al
e21873635c [utils] Using token offsets to calculate lengths for contiguous string arrays, inlining a few functions 2015-04-15 20:17:03 -04:00
Al
24e62b1c6c [tokenization] Script to generate TR-29 ranges for re2c scanner 2015-04-14 15:50:50 -04:00
Al
5fa03587fb [cldr] simple Python scanner for creating dynamic scanners for CLDR rule parsing 2015-04-14 15:49:24 -04:00
Al
efdcbc9eef [project] adding a Python .gitignore for scripts, Python lib, etc. 2015-04-14 15:48:43 -04:00
Al
6e9295154a [fix] local dirs for cldr data 2015-04-14 15:46:15 -04:00
Al
744231c148 [fix] cldr supplemental uses local copy 2015-04-13 19:03:44 -04:00