Commit Graph

22 Commits

Author SHA1 Message Date
Al
a278cfd12c [transliteration] Using revisit strings instead of keeping a backtrack count so we don't have to later map logical characters to the actual string, removing any duplicate keys in the table builder so that if any rules happen to overlap within a step, the first will take precedence 2015-05-29 16:54:05 -04:00
Al
a9d5b91ac0 [transliteration] Not counting repeat character in group capture 2015-05-28 19:36:25 -04:00
Al
c00ecf6ea8 [fix] minimizing c* into (c|'')+, using empty transition instead of zero-length string 2015-05-22 18:11:54 -04:00
Al
b2d15b29cf [fix] greek_latin_ungegn => greek-latin-ungegn 2015-05-22 09:52:08 -04:00
Al
d65f7747f0 [transliteration] Adding html escapes as the first step in the Latin-ASCII transformation 2015-05-20 14:44:55 -04:00
Al
4694371cdc [fix] unicode escaping the German transliterations 2015-05-18 13:55:57 -04:00
Al
e25f039ee4 [transliteration] Escaped single quotes in rules + ignoring rules with codepoints > \uffff 2015-05-17 18:31:35 -04:00
Al
d72348d47e [transliteratin] Using a restricted set of diacritical marks relevant to Greek, variants stand in for transliterator dependencies e.g. use Katakana-Latin-BGN if Katakana-Latin cannot be found 2015-05-17 17:42:37 -04:00
Al
99115fa53c [transliteration] converting one of the more complicated and frequently used rules to its utf8proc equivalent, adding better support for escaped unicode characters and set differences, generating a header file indicating which unicode script/language pairs warrant various transliterators. 2015-05-16 23:13:01 -04:00
Al
1f3ac0c3f9 [transliteration] using a proper lexer on the entire rule to correct some parses, allowing bracketed multiple characters in sets, fixing optionals 2015-05-14 16:34:03 -04:00
Al
304dc9525a [transliteration] fixing variable assignments, literal wide characters (for narrow Python builds), ignoring rules related to spaced Han 2015-05-13 16:20:52 -04:00
Al
5bbf71ccbb [transliteration] Using breadth-first search for tracking dependencies between transforms, removing Han-Spacedhan since our tokenizer does the equivalent already 2015-05-12 18:57:57 -04:00
Al
d5f9d8a29a [mv] unicode_scripts => unicode_properties 2015-05-12 12:14:59 -04:00
Al
3814af52ec [transliteration] Python script now implements the full TR-35 spec, including filter rules, which cuts down significantly on the size of the data file and complexity of generating the trie 2015-05-12 12:10:15 -04:00
Al
fe044cebef [transliteration] char set mapping for some of the more complicated sets found in CLDR 2015-05-10 18:34:53 -04:00
Al
2a69488f9b [fix] for transliteration rules, allowing the parsing of set differencees and arbitrarily nested character set expressions, using non-NUL byte for the empty transition. Adding resulting data file. 2015-05-08 17:14:26 -04:00
Al
10ebaf147a [transliteration] literal ^ and $ escaped 2015-05-01 19:16:36 -04:00
Al
ff851a464c [fix] escaping curly braces for regex compilation 2015-04-30 13:27:17 -04:00
Al
fa43abd8d9 [transliteration] For ruleset steps in transliteration, the name is just the step number, which can be appended to the trie as part of the key 2015-04-29 14:31:15 -04:00
Al
1c25238af7 [fix] string lengths on the various transliteration rules 2015-04-27 13:51:35 -04:00
Al
6ebea11640 [transliteration] fixing transliteration rules, fixing escape characters, adding sizes to all the strings as they may have null characters 2015-04-26 19:47:54 -04:00
Al
be29874f13 [transliteration] Parser for CLDR transforms to generate (simple) C transform rules 2015-04-25 15:42:21 -04:00