Commit Graph

154 Commits

Author SHA1 Message Date
Al
26ff3292d2 [fix] new script name, prefix result 2015-05-23 21:41:11 -04:00
Al
31cc2bb5d1 [fix] merging repeat codepoints in trie builder 2015-05-22 22:45:23 -04:00
Al
c00ecf6ea8 [fix] minimizing c* into (c|'')+, using empty transition instead of zero-length string 2015-05-22 18:11:54 -04:00
Al
b2d15b29cf [fix] greek_latin_ungegn => greek-latin-ungegn 2015-05-22 09:52:08 -04:00
Al
27171e068d [phrases] constant for NULL prefix results 2015-05-22 09:08:07 -04:00
Al
cb14e5eef1 [phrases] trie_get_prefix_from_index takes an optinal tail position 2015-05-21 06:16:14 -04:00
Al
91ccdf6f7b [phrases] trie_get_prefix_* methods return a struct including tail position 2015-05-21 05:38:21 -04:00
Al
395fbcb8b5 [fix] get_prefix on tries searches tail as well 2015-05-21 05:22:44 -04:00
Al
e84f3d93d2 [fix] get_prefix on tries searches tail as well 2015-05-20 20:57:14 -04:00
Al
c9ff3f278f [transliteration] new transform data file 2015-05-20 14:45:16 -04:00
Al
d65f7747f0 [transliteration] Adding html escapes as the first step in the Latin-ASCII transformation 2015-05-20 14:44:55 -04:00
Al
1fee0a3e35 [phrases] separating get_data_node from tail_match for tries 2015-05-20 13:51:04 -04:00
Al
bfb9aa21a1 [fix] unused var 2015-05-19 18:04:06 -04:00
Al
3d25378456 [transliteration] fixing a few warnings 2015-05-19 18:03:36 -04:00
Al
fdf988cb27 [phrases] adding a public get_data_node method for tries 2015-05-19 18:02:29 -04:00
Al
9d309ca9d3 [fix] moving constant 2015-05-18 14:25:21 -04:00
Al
eecee39904 [fix] giving constant trie node names more specificity 2015-05-18 14:24:39 -04:00
Al
c66f6f0fbe [transliteration] adding begin set token for regex character sets and fixing off-by-one in concatenated trie keys 2015-05-18 14:00:14 -04:00
Al
3c1e5c0471 [transliteration] new data file with the escaped German transliterations 2015-05-18 13:57:45 -04:00
Al
58571f70cc [utils] adding a boolean flag on string tree iterators for single path trees 2015-05-18 13:57:11 -04:00
Al
4694371cdc [fix] unicode escaping the German transliterations 2015-05-18 13:55:57 -04:00
Al
7eaa94d2fb [transliteration] new data file 2015-05-17 18:31:52 -04:00
Al
e25f039ee4 [transliteration] Escaped single quotes in rules + ignoring rules with codepoints > \uffff 2015-05-17 18:31:35 -04:00
Al
c39a19a352 [transliteration] New data file with the Greek/Katakana additins 2015-05-17 17:59:39 -04:00
Al
d72348d47e [transliteratin] Using a restricted set of diacritical marks relevant to Greek, variants stand in for transliterator dependencies e.g. use Katakana-Latin-BGN if Katakana-Latin cannot be found 2015-05-17 17:42:37 -04:00
Al
30db201e8a [fix] NUM_CHARS => NUM_CODEPOINTS 2015-05-17 13:53:19 -04:00
Al
1348cc8906 [transliteration] Switching the begin/end set chars 2015-05-17 12:02:46 -04:00
Al
f1cfb30209 [transliteration] generated scripts file 2015-05-17 00:00:14 -04:00
Al
b983a83a89 [transliteration] transliteration struct definitions, memory allocaiton, builder methods and I/O, stubbing transliterate method for the moment 2015-05-16 23:23:25 -04:00
Al
3a74a8c179 [transliteration] script to build transliteration table, trie, C structures, etc. from the rules 2015-05-16 23:22:16 -04:00
Al
65624c8985 [fix] vector_*_pop returns the element 2015-05-16 23:20:28 -04:00
Al
4a67294fbf [phrases] adding get_prefix methods for tries, remove add_nodes_only, fixing a few things and inlining a few functions 2015-05-16 23:19:59 -04:00
Al
e8fdd4564d [utils] adding string_tree for listing sets of token alternatives and string_tree_iterator to generate permutations over the strings, needed for transliteration and ambiguous address elements/place names 2015-05-16 23:16:10 -04:00
Al
f151a2232c [transliteration] new transliteration rules data file 2015-05-16 23:14:47 -04:00
Al
99115fa53c [transliteration] converting one of the more complicated and frequently used rules to its utf8proc equivalent, adding better support for escaped unicode characters and set differences, generating a header file indicating which unicode script/language pairs warrant various transliterators. 2015-05-16 23:13:01 -04:00
Al
5983cb6af0 [i18n] Adding NUM_SCRIPTS to the end of the scripts enum 2015-05-16 12:19:40 -04:00
Al
8699409f15 [transliteration] resulting data file 2015-05-14 16:34:49 -04:00
Al
1f3ac0c3f9 [transliteration] using a proper lexer on the entire rule to correct some parses, allowing bracketed multiple characters in sets, fixing optionals 2015-05-14 16:34:03 -04:00
Al
2d49369e78 [utils] Adding read/write for 64-bit ints to file_utils 2015-05-13 17:51:03 -04:00
Al
6898f8ecd9 [transliteration] Adding context types back to transtlieration rule struct since they don't matter in the actual transliteration table 2015-05-13 16:51:07 -04:00
Al
b777b60e07 [transliteration] new data file 2015-05-13 16:21:16 -04:00
Al
304dc9525a [transliteration] fixing variable assignments, literal wide characters (for narrow Python builds), ignoring rules related to spaced Han 2015-05-13 16:20:52 -04:00
Al
cbe83376f2 [transliteration] Adding new, even smaller, generated data file 2015-05-12 18:58:38 -04:00
Al
5bbf71ccbb [transliteration] Using breadth-first search for tracking dependencies between transforms, removing Han-Spacedhan since our tokenizer does the equivalent already 2015-05-12 18:57:57 -04:00
Al
b55db5fcda [fix] usage text 2015-05-12 12:15:51 -04:00
Al
d5f9d8a29a [mv] unicode_scripts => unicode_properties 2015-05-12 12:14:59 -04:00
Al
0984fb9ea4 [transliteration] new, more compact transliteration data file 2015-05-12 12:13:11 -04:00
Al
ff0e7cb9e1 [i18n] downloading several files from the Unicode Character Database 2015-05-12 12:12:17 -04:00
Al
3814af52ec [transliteration] Python script now implements the full TR-35 spec, including filter rules, which cuts down significantly on the size of the data file and complexity of generating the trie 2015-05-12 12:10:15 -04:00
Al
fe044cebef [transliteration] char set mapping for some of the more complicated sets found in CLDR 2015-05-10 18:34:53 -04:00