Commit Graph

18 Commits

Author SHA1 Message Date
Al
d65f7747f0 [transliteration] Adding html escapes as the first step in the Latin-ASCII transformation 2015-05-20 14:44:55 -04:00
Al
4694371cdc [fix] unicode escaping the German transliterations 2015-05-18 13:55:57 -04:00
Al
e25f039ee4 [transliteration] Escaped single quotes in rules + ignoring rules with codepoints > \uffff 2015-05-17 18:31:35 -04:00
Al
d72348d47e [transliteratin] Using a restricted set of diacritical marks relevant to Greek, variants stand in for transliterator dependencies e.g. use Katakana-Latin-BGN if Katakana-Latin cannot be found 2015-05-17 17:42:37 -04:00
Al
99115fa53c [transliteration] converting one of the more complicated and frequently used rules to its utf8proc equivalent, adding better support for escaped unicode characters and set differences, generating a header file indicating which unicode script/language pairs warrant various transliterators. 2015-05-16 23:13:01 -04:00
Al
1f3ac0c3f9 [transliteration] using a proper lexer on the entire rule to correct some parses, allowing bracketed multiple characters in sets, fixing optionals 2015-05-14 16:34:03 -04:00
Al
304dc9525a [transliteration] fixing variable assignments, literal wide characters (for narrow Python builds), ignoring rules related to spaced Han 2015-05-13 16:20:52 -04:00
Al
5bbf71ccbb [transliteration] Using breadth-first search for tracking dependencies between transforms, removing Han-Spacedhan since our tokenizer does the equivalent already 2015-05-12 18:57:57 -04:00
Al
d5f9d8a29a [mv] unicode_scripts => unicode_properties 2015-05-12 12:14:59 -04:00
Al
3814af52ec [transliteration] Python script now implements the full TR-35 spec, including filter rules, which cuts down significantly on the size of the data file and complexity of generating the trie 2015-05-12 12:10:15 -04:00
Al
fe044cebef [transliteration] char set mapping for some of the more complicated sets found in CLDR 2015-05-10 18:34:53 -04:00
Al
2a69488f9b [fix] for transliteration rules, allowing the parsing of set differencees and arbitrarily nested character set expressions, using non-NUL byte for the empty transition. Adding resulting data file. 2015-05-08 17:14:26 -04:00
Al
10ebaf147a [transliteration] literal ^ and $ escaped 2015-05-01 19:16:36 -04:00
Al
ff851a464c [fix] escaping curly braces for regex compilation 2015-04-30 13:27:17 -04:00
Al
fa43abd8d9 [transliteration] For ruleset steps in transliteration, the name is just the step number, which can be appended to the trie as part of the key 2015-04-29 14:31:15 -04:00
Al
1c25238af7 [fix] string lengths on the various transliteration rules 2015-04-27 13:51:35 -04:00
Al
6ebea11640 [transliteration] fixing transliteration rules, fixing escape characters, adding sizes to all the strings as they may have null characters 2015-04-26 19:47:54 -04:00
Al
be29874f13 [transliteration] Parser for CLDR transforms to generate (simple) C transform rules 2015-04-25 15:42:21 -04:00