Commit Graph

  • 7dcb4bf6f4 [numex] correct signature Al 2015-06-02 16:08:19 -04:00
  • 93d65d0186 [numex] numex table builder, fix to constant Al 2015-06-02 13:57:00 -04:00
  • a44997c71c [fix] new generated numex data file Al 2015-06-02 13:45:06 -04:00
  • 2ea21dfffb [fix] constants Al 2015-06-02 13:44:25 -04:00
  • 2d5d854754 [fix] compilation/warnings Al 2015-06-02 13:43:16 -04:00
  • 208366af98 [fix] removing stopwords index Al 2015-06-02 12:43:48 -04:00
  • 49816382c1 [numex] New generated data file Al 2015-06-02 12:37:39 -04:00
  • 9d0d83bc14 [numex] adding stopword rules with the regular numex rules Al 2015-06-02 12:37:22 -04:00
  • 816a0408ab [numex] numex_rule.h Al 2015-06-02 12:30:56 -04:00
  • 8ef3a50b79 [numex] Initial generated numex data file Al 2015-06-02 12:28:28 -04:00
  • 4ad978f22c [numex] Using the new representation for generated data Al 2015-06-02 12:28:07 -04:00
  • 958c219b88 [utils] constants.h Al 2015-06-02 12:25:58 -04:00
  • 2dc870b3da [numex] Python script to generate numex data Al 2015-06-02 10:15:02 -04:00
  • 6b3d434c31 [fix] removing unnecessary definition Al 2015-06-01 17:13:57 -04:00
  • 9c935c9cc7 [fix] Base data dir path Al 2015-06-01 17:13:06 -04:00
  • 505456d9d2 [fix] removing unnecessary header Al 2015-06-01 17:12:33 -04:00
  • 080f382065 [numex] Removing concatenated property from language struct as all numeric spellouts might be concatenated Al 2015-06-01 17:12:07 -04:00
  • a20b768237 [numex] Russian numex rules (a start at least, might need a native speaker to review the RBNF transform in CLDR) Al 2015-06-01 17:08:46 -04:00
  • 05ffbffb23 [numex] Latin numex rules i.e. Roman numerals, used for most languages Al 2015-06-01 17:07:58 -04:00
  • 028bb5a1aa [numex] German numex rules Al 2015-06-01 17:07:35 -04:00
  • 9bd75cee23 [numex] Romance language numex rules (Spanish, French, Italian, Portuguese) Al 2015-06-01 17:07:23 -04:00
  • 99aed992da [numex] English numex rules Al 2015-06-01 17:06:53 -04:00
  • 920e15bd4d [numex] Adding numex setup/IO methods Al 2015-06-01 15:42:44 -04:00
  • c0347a3431 [numex] numex header and structs Al 2015-06-01 15:41:34 -04:00
  • b74fa0da99 [config] Adding config header Al 2015-06-01 15:40:59 -04:00
  • 93172bd16d [transliteration] New transliterator_scripts file Al 2015-05-31 02:09:28 -04:00
  • 0575984144 [transliteration] New data file Al 2015-05-31 02:08:26 -04:00
  • 6ac4ff6021 [transliteration] Adding reverse/bidirectional transforms e.g. for Katakana-Latin Al 2015-05-31 02:07:36 -04:00
  • 664d5e90db [fix] Removing the stub comment and a few more random comments Al 2015-05-29 20:10:12 -04:00
  • 06318a6fab [fix] logging code Al 2015-05-29 20:08:49 -04:00
  • 55568e9ffa [fix] Removing commented out section Al 2015-05-29 20:01:12 -04:00
  • 583cadd44f [transliteration] transliterate implementation from trie (need to build/save the tables first) Al 2015-05-29 19:59:37 -04:00
  • 6239c2fcfc [transliteration] regenerated data file including InterIndic-Latin dependency Al 2015-05-29 19:48:19 -04:00
  • 9547c93a38 [fix] InterIndic-Latin is an internal transliterator, but needed for most of the Indic languages. Also fixing the string lengths for HTML entity replacements Al 2015-05-29 19:47:49 -04:00
  • 8b56d63fde [fix] only count non-set chars in parse_groups Al 2015-05-29 19:42:05 -04:00
  • a278cfd12c [transliteration] Using revisit strings instead of keeping a backtrack count so we don't have to later map logical characters to the actual string, removing any duplicate keys in the table builder so that if any rules happen to overlap within a step, the first will take precedence Al 2015-05-29 16:53:28 -04:00
  • a9d5b91ac0 [transliteration] Not counting repeat character in group capture Al 2015-05-28 19:36:25 -04:00
  • 0177fd4b13 [fix] trie_search using proper length in utf8proc_iterate Al 2015-05-27 16:08:09 -04:00
  • ad8e92182c [phrases] trie I/O using the uint APIs, fixes to trie_get_prefix_result_from_index Al 2015-05-27 16:06:35 -04:00
  • 897c29ccb8 [fix] transliterate.h Al 2015-05-27 16:04:18 -04:00
  • 17f88c3adc [utils] using unsigned ints in file_utils, adding doubles Al 2015-05-27 16:03:36 -04:00
  • 8ac8f83b7f [utils] changing signature of utf8proc_iterate_reversed so it takes the same arguments as utf8proc_iterate for function pointer purposes Al 2015-05-25 15:35:28 -04:00
  • 26ff3292d2 [fix] new script name, prefix result Al 2015-05-23 21:41:11 -04:00
  • 31cc2bb5d1 [fix] merging repeat codepoints in trie builder Al 2015-05-22 22:45:23 -04:00
  • c00ecf6ea8 [fix] minimizing c* into (c|'')+, using empty transition instead of zero-length string Al 2015-05-22 18:11:54 -04:00
  • b2d15b29cf [fix] greek_latin_ungegn => greek-latin-ungegn Al 2015-05-22 09:52:08 -04:00
  • 27171e068d [phrases] constant for NULL prefix results Al 2015-05-22 09:08:07 -04:00
  • cb14e5eef1 [phrases] trie_get_prefix_from_index takes an optinal tail position Al 2015-05-21 06:16:08 -04:00
  • 91ccdf6f7b [phrases] trie_get_prefix_* methods return a struct including tail position Al 2015-05-21 05:38:18 -04:00
  • 395fbcb8b5 [fix] get_prefix on tries searches tail as well Al 2015-05-20 20:57:14 -04:00
  • e84f3d93d2 [fix] get_prefix on tries searches tail as well Al 2015-05-20 20:57:14 -04:00
  • c9ff3f278f [transliteration] new transform data file Al 2015-05-20 14:45:16 -04:00
  • d65f7747f0 [transliteration] Adding html escapes as the first step in the Latin-ASCII transformation Al 2015-05-20 14:44:55 -04:00
  • 1fee0a3e35 [phrases] separating get_data_node from tail_match for tries Al 2015-05-20 13:51:04 -04:00
  • bfb9aa21a1 [fix] unused var Al 2015-05-19 18:04:06 -04:00
  • 3d25378456 [transliteration] fixing a few warnings Al 2015-05-19 18:03:36 -04:00
  • fdf988cb27 [phrases] adding a public get_data_node method for tries Al 2015-05-19 18:02:29 -04:00
  • 9d309ca9d3 [fix] moving constant Al 2015-05-18 14:25:21 -04:00
  • eecee39904 [fix] giving constant trie node names more specificity Al 2015-05-18 14:24:39 -04:00
  • c66f6f0fbe [transliteration] adding begin set token for regex character sets and fixing off-by-one in concatenated trie keys Al 2015-05-18 14:00:14 -04:00
  • 3c1e5c0471 [transliteration] new data file with the escaped German transliterations Al 2015-05-18 13:57:45 -04:00
  • 58571f70cc [utils] adding a boolean flag on string tree iterators for single path trees Al 2015-05-18 13:57:11 -04:00
  • 4694371cdc [fix] unicode escaping the German transliterations Al 2015-05-18 13:55:57 -04:00
  • 7eaa94d2fb [transliteration] new data file Al 2015-05-17 18:31:52 -04:00
  • e25f039ee4 [transliteration] Escaped single quotes in rules + ignoring rules with codepoints > \uffff Al 2015-05-17 18:31:35 -04:00
  • c39a19a352 [transliteration] New data file with the Greek/Katakana additins Al 2015-05-17 17:59:39 -04:00
  • d72348d47e [transliteratin] Using a restricted set of diacritical marks relevant to Greek, variants stand in for transliterator dependencies e.g. use Katakana-Latin-BGN if Katakana-Latin cannot be found Al 2015-05-17 17:42:29 -04:00
  • 30db201e8a [fix] NUM_CHARS => NUM_CODEPOINTS Al 2015-05-17 13:53:01 -04:00
  • 1348cc8906 [transliteration] Switching the begin/end set chars Al 2015-05-17 12:02:46 -04:00
  • f1cfb30209 [transliteration] generated scripts file Al 2015-05-17 00:00:14 -04:00
  • b983a83a89 [transliteration] transliteration struct definitions, memory allocaiton, builder methods and I/O, stubbing transliterate method for the moment Al 2015-05-16 23:23:23 -04:00
  • 3a74a8c179 [transliteration] script to build transliteration table, trie, C structures, etc. from the rules Al 2015-05-16 23:22:16 -04:00
  • 65624c8985 [fix] vector_*_pop returns the element Al 2015-05-16 23:20:28 -04:00
  • 4a67294fbf [phrases] adding get_prefix methods for tries, remove add_nodes_only, fixing a few things and inlining a few functions Al 2015-05-16 23:19:59 -04:00
  • e8fdd4564d [utils] adding string_tree for listing sets of token alternatives and string_tree_iterator to generate permutations over the strings, needed for transliteration and ambiguous address elements/place names Al 2015-05-16 23:16:10 -04:00
  • f151a2232c [transliteration] new transliteration rules data file Al 2015-05-16 23:14:47 -04:00
  • 99115fa53c [transliteration] converting one of the more complicated and frequently used rules to its utf8proc equivalent, adding better support for escaped unicode characters and set differences, generating a header file indicating which unicode script/language pairs warrant various transliterators. Al 2015-05-16 23:12:29 -04:00
  • 5983cb6af0 [i18n] Adding NUM_SCRIPTS to the end of the scripts enum Al 2015-05-16 12:19:40 -04:00
  • 8699409f15 [transliteration] resulting data file Al 2015-05-14 16:34:49 -04:00
  • 1f3ac0c3f9 [transliteration] using a proper lexer on the entire rule to correct some parses, allowing bracketed multiple characters in sets, fixing optionals Al 2015-05-14 16:34:03 -04:00
  • 2d49369e78 [utils] Adding read/write for 64-bit ints to file_utils Al 2015-05-13 17:51:03 -04:00
  • 6898f8ecd9 [transliteration] Adding context types back to transtlieration rule struct since they don't matter in the actual transliteration table Al 2015-05-13 16:51:07 -04:00
  • b777b60e07 [transliteration] new data file Al 2015-05-13 16:21:16 -04:00
  • 304dc9525a [transliteration] fixing variable assignments, literal wide characters (for narrow Python builds), ignoring rules related to spaced Han Al 2015-05-13 16:20:52 -04:00
  • cbe83376f2 [transliteration] Adding new, even smaller, generated data file Al 2015-05-12 18:58:35 -04:00
  • 5bbf71ccbb [transliteration] Using breadth-first search for tracking dependencies between transforms, removing Han-Spacedhan since our tokenizer does the equivalent already Al 2015-05-12 18:57:57 -04:00
  • b55db5fcda [fix] usage text Al 2015-05-12 12:15:51 -04:00
  • d5f9d8a29a [mv] unicode_scripts => unicode_properties Al 2015-05-12 12:14:59 -04:00
  • 0984fb9ea4 [transliteration] new, more compact transliteration data file Al 2015-05-12 12:13:06 -04:00
  • ff0e7cb9e1 [i18n] downloading several files from the Unicode Character Database Al 2015-05-12 12:12:17 -04:00
  • 3814af52ec [transliteration] Python script now implements the full TR-35 spec, including filter rules, which cuts down significantly on the size of the data file and complexity of generating the trie Al 2015-05-12 12:10:15 -04:00
  • fe044cebef [transliteration] char set mapping for some of the more complicated sets found in CLDR Al 2015-05-10 18:34:53 -04:00
  • 2a69488f9b [fix] for transliteration rules, allowing the parsing of set differencees and arbitrarily nested character set expressions, using non-NUL byte for the empty transition. Adding resulting data file. Al 2015-05-08 17:14:22 -04:00
  • 10ebaf147a [transliteration] literal ^ and $ escaped Al 2015-05-01 19:16:26 -04:00
  • ff851a464c [fix] escaping curly braces for regex compilation Al 2015-04-30 13:27:17 -04:00
  • fa43abd8d9 [transliteration] For ruleset steps in transliteration, the name is just the step number, which can be appended to the trie as part of the key Al 2015-04-29 14:31:15 -04:00
  • 1c25238af7 [fix] string lengths on the various transliteration rules Al 2015-04-27 13:51:35 -04:00
  • 1373843b86 [fix] setting last_node in tokenized trie search in the case where a prefix phrase matches but the longer string doesn't. Al 2015-04-27 01:49:02 -04:00
  • b2ba629f95 [fix] trie_get methods just return node index rather than data value Al 2015-04-27 01:28:05 -04:00
  • 8fb9bacfa6 [phrases] New trie_add_nodes_only method for concatenating strings to the trie, plus boolean return values on trie_add_* APIs Al 2015-04-27 01:01:43 -04:00