Commit Graph

80 Commits

Author SHA1 Message Date
Al
7dfbcce9ec [languages] options for get_country_languages 2015-09-30 04:09:07 -04:00
Al
5417b4e602 [unicode] Downloading latest UnicodeData.txt instead of using builtin Python module (out of date) e.g. for getting unicode codepoint categories 2015-09-25 23:59:38 -04:00
Al
abfb1d4a60 [transliteration] Wide char support in transliteration data generator 2015-09-23 03:56:12 -04:00
Al
13bcc35523 [unicode] Allowing wide chars in unicode properties 2015-09-23 00:34:07 -04:00
Al
b4593b6f88 [unicode/tokenization] Using new character classes including wide chars in scanner 2015-09-23 00:33:14 -04:00
Al
a76831df7a [unicode] Wide version of word breaks 2015-09-22 18:55:33 -04:00
Al
a916668f28 [i18n] Local file for ISO 15924 2015-09-01 23:58:36 -04:00
Al
b8e4c19146 [mv] Moving the get regional/country languages logic out of language polygons 2015-08-23 14:25:33 -04:00
Al
122a81b610 [languages] non-default languages can still be labeled from > 1 char abbreviations if there's no evidence of other languages in the string. Adding Python version of get_string_script from the C lib 2015-08-23 02:26:06 -04:00
Al
0701bb6f08 [fix] import 2015-08-22 23:19:43 -04:00
Al
d97c725bbc [languages] Allowing specification of multiple regional languages 2015-08-18 03:18:52 -04:00
Al
03febc7e20 [scripts] Better script code aliasing 2015-08-13 18:25:55 -04:00
Al
b54ff95ecc [mv] csv_utils 2015-08-13 18:19:54 -04:00
Al
cf70615850 [transliteration] Doing HTML escapes first in Latin-ASCII transliteration as they may need to be resolved further in subsequent steps 2015-08-11 23:10:55 -04:00
Al
51addec5f2 [fix] check for local CLDR in unicode properties 2015-08-11 20:23:48 -04:00
Al
882e4c2ab8 [fix] ensure CLDR dir 2015-08-11 20:04:42 -04:00
Al
48566bf097 [fix] cldr languages dir 2015-08-11 20:04:25 -04:00
Al
dd391eabe5 [numex] Separating rules from keys for Linux gcc compilation 2015-08-09 01:00:57 -04:00
Al
1d39916aaa [fix] Fixing warnings in unicode script data 2015-08-02 21:30:54 -06:00
Al
87566bb6a5 [numex] Adding validation checks for numex JSON 2015-07-24 15:22:07 -04:00
Al
64a63fdf51 [mv] Moving all repo data files to a resources dir, data is only for runtime files 2015-07-21 18:11:36 -04:00
Al
076c07e21f [fix] Add minor languages to the language set 2015-07-16 00:58:58 -04:00
Al
95a6845a85 [i18n] Adding regional languages as valid country languages 2015-07-08 14:54:00 -04:00
Al
a580ed0b1b [transliteration] Adding numeric HTML escapes e.g. '&' 2015-06-29 15:02:34 -04:00
Al
8fb6a28e9c [fix] using empty string instead of NULL for script languages so we can use fixed length arrays 2015-06-23 15:20:09 -05:00
Al
b21c3a3a2f [transliteration] using different struct in script data header file 2015-06-22 22:06:16 -05:00
Al
c2b4744f55 [transliteration] Using a data file instead of a header for transliteration scripts 2015-06-21 05:37:56 -05:00
Al
84b9a6ff33 [transliteration] Adding Hangul-Latin and Jamo-Latin back into the mix with a restricted filter. Reversing all previous contexts by character group 2015-06-17 23:42:31 -04:00
Al
f04fad0e93 [i18n] Generating Hangul syllable classes 2015-06-16 12:50:48 -04:00
Al
67bd9f1a31 [i18n] Adding languages.py 2015-06-15 17:48:47 -04:00
Al
fc735bb5c3 [numex] Adding a whole words only option on numex languages e.g. for Latin so we don't match an initial D with 500 2015-06-12 16:09:45 -04:00
Al
2d098fdab6 [numex] Adding ordinal_indicator rule type for CJK ordinals 2015-06-04 11:24:13 -04:00
Al
4c49f63caf [numex] Adding categories to numex for plurals, etc. Ordinal indicators support multiple variants (primer in Spanish can be written as 1er or 1r for instance) and longer suffixes e.g. for tracking 1=>1st but 11=>11th 2015-06-04 03:09:39 -04:00
Al
b2fe9d4db0 [transliteration] Adding uppercase umlauts and Scandinativan a-ring 2015-06-03 22:55:45 -04:00
Al
2ea21dfffb [fix] constants 2015-06-02 13:44:25 -04:00
Al
208366af98 [fix] removing stopwords index 2015-06-02 12:43:48 -04:00
Al
9d0d83bc14 [numex] adding stopword rules with the regular numex rules 2015-06-02 12:37:22 -04:00
Al
4ad978f22c [numex] Using the new representation for generated data 2015-06-02 12:28:07 -04:00
Al
2dc870b3da [numex] Python script to generate numex data 2015-06-02 10:15:02 -04:00
Al
6b3d434c31 [fix] removing unnecessary definition 2015-06-01 17:13:57 -04:00
Al
9c935c9cc7 [fix] Base data dir path 2015-06-01 17:13:06 -04:00
Al
6ac4ff6021 [transliteration] Adding reverse/bidirectional transforms e.g. for Katakana-Latin 2015-05-31 02:07:36 -04:00
Al
9547c93a38 [fix] InterIndic-Latin is an internal transliterator, but needed for most of the Indic languages. Also fixing the string lengths for HTML entity replacements 2015-05-29 19:47:49 -04:00
Al
a278cfd12c [transliteration] Using revisit strings instead of keeping a backtrack count so we don't have to later map logical characters to the actual string, removing any duplicate keys in the table builder so that if any rules happen to overlap within a step, the first will take precedence 2015-05-29 16:54:05 -04:00
Al
a9d5b91ac0 [transliteration] Not counting repeat character in group capture 2015-05-28 19:36:25 -04:00
Al
c00ecf6ea8 [fix] minimizing c* into (c|'')+, using empty transition instead of zero-length string 2015-05-22 18:11:54 -04:00
Al
b2d15b29cf [fix] greek_latin_ungegn => greek-latin-ungegn 2015-05-22 09:52:08 -04:00
Al
d65f7747f0 [transliteration] Adding html escapes as the first step in the Latin-ASCII transformation 2015-05-20 14:44:55 -04:00
Al
4694371cdc [fix] unicode escaping the German transliterations 2015-05-18 13:55:57 -04:00
Al
e25f039ee4 [transliteration] Escaped single quotes in rules + ignoring rules with codepoints > \uffff 2015-05-17 18:31:35 -04:00