Al
|
b2e201f297
|
[fix] trailing comma
|
2015-06-20 15:14:41 -05:00 |
|
Al
|
d4087be40c
|
[geonames] Pre-escaping tabs, no quoting in geonames/postal code TSVs
|
2015-06-20 11:54:47 -05:00 |
|
Al
|
ab1fb3669f
|
[geonames] Only take alternative names that are != to the canonical name, sort by name, population desc, geonames_id
|
2015-06-19 15:47:50 -05:00 |
|
Al
|
84b9a6ff33
|
[transliteration] Adding Hangul-Latin and Jamo-Latin back into the mix with a restricted filter. Reversing all previous contexts by character group
|
2015-06-17 23:42:31 -04:00 |
|
Al
|
f04fad0e93
|
[i18n] Generating Hangul syllable classes
|
2015-06-16 12:50:48 -04:00 |
|
Al
|
cb2035867b
|
[fix] osm geodata imports
|
2015-06-15 18:36:01 -04:00 |
|
Al
|
d2d25ead6f
|
[utils] Adding unicode_csv module
|
2015-06-15 18:06:54 -04:00 |
|
Al
|
ccb64f7ac2
|
[polygons] Adding address_normalizer polygons package
|
2015-06-15 17:55:27 -04:00 |
|
Al
|
22fa81b33f
|
[fix] __init__.py
|
2015-06-15 17:54:27 -04:00 |
|
Al
|
41dbd97bf2
|
[geodisambig] quattroshapes download can use default or specified location, unzips files
|
2015-06-15 17:54:08 -04:00 |
|
Al
|
037d4575ae
|
[geodisambig] Modifying GeoNames TSV again. Using files again and sorting
|
2015-06-15 17:51:09 -04:00 |
|
Al
|
67bd9f1a31
|
[i18n] Adding languages.py
|
2015-06-15 17:48:47 -04:00 |
|
Al
|
073fe43698
|
[geodisambig] Adding quattroshapes download script
|
2015-06-15 17:46:11 -04:00 |
|
Al
|
73f37fe66b
|
[fix] Moving default Geonames DB path to a shared module
|
2015-06-15 12:53:00 -04:00 |
|
Al
|
7a4fa7d443
|
[geodisambig] Canonical country names from CLDR, adding alpha-2 and alpha-3 surface forms, writing results to stdout or a file for streaming
|
2015-06-15 01:58:43 -04:00 |
|
Al
|
43e023077c
|
[fix] Changing logging to stderr for the Geonames scripts
|
2015-06-14 15:38:57 -04:00 |
|
Al
|
fc735bb5c3
|
[numex] Adding a whole words only option on numex languages e.g. for Latin so we don't match an initial D with 500
|
2015-06-12 16:09:45 -04:00 |
|
Al
|
2d098fdab6
|
[numex] Adding ordinal_indicator rule type for CJK ordinals
|
2015-06-04 11:24:13 -04:00 |
|
Al
|
4c49f63caf
|
[numex] Adding categories to numex for plurals, etc. Ordinal indicators support multiple variants (primer in Spanish can be written as 1er or 1r for instance) and longer suffixes e.g. for tracking 1=>1st but 11=>11th
|
2015-06-04 03:09:39 -04:00 |
|
Al
|
b2fe9d4db0
|
[transliteration] Adding uppercase umlauts and Scandinativan a-ring
|
2015-06-03 22:55:45 -04:00 |
|
Al
|
2ea21dfffb
|
[fix] constants
|
2015-06-02 13:44:25 -04:00 |
|
Al
|
208366af98
|
[fix] removing stopwords index
|
2015-06-02 12:43:48 -04:00 |
|
Al
|
9d0d83bc14
|
[numex] adding stopword rules with the regular numex rules
|
2015-06-02 12:37:22 -04:00 |
|
Al
|
4ad978f22c
|
[numex] Using the new representation for generated data
|
2015-06-02 12:28:07 -04:00 |
|
Al
|
2dc870b3da
|
[numex] Python script to generate numex data
|
2015-06-02 10:15:02 -04:00 |
|
Al
|
6b3d434c31
|
[fix] removing unnecessary definition
|
2015-06-01 17:13:57 -04:00 |
|
Al
|
9c935c9cc7
|
[fix] Base data dir path
|
2015-06-01 17:13:06 -04:00 |
|
Al
|
6ac4ff6021
|
[transliteration] Adding reverse/bidirectional transforms e.g. for Katakana-Latin
|
2015-05-31 02:07:36 -04:00 |
|
Al
|
9547c93a38
|
[fix] InterIndic-Latin is an internal transliterator, but needed for most of the Indic languages. Also fixing the string lengths for HTML entity replacements
|
2015-05-29 19:47:49 -04:00 |
|
Al
|
a278cfd12c
|
[transliteration] Using revisit strings instead of keeping a backtrack count so we don't have to later map logical characters to the actual string, removing any duplicate keys in the table builder so that if any rules happen to overlap within a step, the first will take precedence
|
2015-05-29 16:54:05 -04:00 |
|
Al
|
a9d5b91ac0
|
[transliteration] Not counting repeat character in group capture
|
2015-05-28 19:36:25 -04:00 |
|
Al
|
c00ecf6ea8
|
[fix] minimizing c* into (c|'')+, using empty transition instead of zero-length string
|
2015-05-22 18:11:54 -04:00 |
|
Al
|
b2d15b29cf
|
[fix] greek_latin_ungegn => greek-latin-ungegn
|
2015-05-22 09:52:08 -04:00 |
|
Al
|
d65f7747f0
|
[transliteration] Adding html escapes as the first step in the Latin-ASCII transformation
|
2015-05-20 14:44:55 -04:00 |
|
Al
|
4694371cdc
|
[fix] unicode escaping the German transliterations
|
2015-05-18 13:55:57 -04:00 |
|
Al
|
e25f039ee4
|
[transliteration] Escaped single quotes in rules + ignoring rules with codepoints > \uffff
|
2015-05-17 18:31:35 -04:00 |
|
Al
|
d72348d47e
|
[transliteratin] Using a restricted set of diacritical marks relevant to Greek, variants stand in for transliterator dependencies e.g. use Katakana-Latin-BGN if Katakana-Latin cannot be found
|
2015-05-17 17:42:37 -04:00 |
|
Al
|
30db201e8a
|
[fix] NUM_CHARS => NUM_CODEPOINTS
|
2015-05-17 13:53:19 -04:00 |
|
Al
|
99115fa53c
|
[transliteration] converting one of the more complicated and frequently used rules to its utf8proc equivalent, adding better support for escaped unicode characters and set differences, generating a header file indicating which unicode script/language pairs warrant various transliterators.
|
2015-05-16 23:13:01 -04:00 |
|
Al
|
5983cb6af0
|
[i18n] Adding NUM_SCRIPTS to the end of the scripts enum
|
2015-05-16 12:19:40 -04:00 |
|
Al
|
1f3ac0c3f9
|
[transliteration] using a proper lexer on the entire rule to correct some parses, allowing bracketed multiple characters in sets, fixing optionals
|
2015-05-14 16:34:03 -04:00 |
|
Al
|
304dc9525a
|
[transliteration] fixing variable assignments, literal wide characters (for narrow Python builds), ignoring rules related to spaced Han
|
2015-05-13 16:20:52 -04:00 |
|
Al
|
5bbf71ccbb
|
[transliteration] Using breadth-first search for tracking dependencies between transforms, removing Han-Spacedhan since our tokenizer does the equivalent already
|
2015-05-12 18:57:57 -04:00 |
|
Al
|
b55db5fcda
|
[fix] usage text
|
2015-05-12 12:15:51 -04:00 |
|
Al
|
d5f9d8a29a
|
[mv] unicode_scripts => unicode_properties
|
2015-05-12 12:14:59 -04:00 |
|
Al
|
ff0e7cb9e1
|
[i18n] downloading several files from the Unicode Character Database
|
2015-05-12 12:12:17 -04:00 |
|
Al
|
3814af52ec
|
[transliteration] Python script now implements the full TR-35 spec, including filter rules, which cuts down significantly on the size of the data file and complexity of generating the trie
|
2015-05-12 12:10:15 -04:00 |
|
Al
|
fe044cebef
|
[transliteration] char set mapping for some of the more complicated sets found in CLDR
|
2015-05-10 18:34:53 -04:00 |
|
Al
|
2a69488f9b
|
[fix] for transliteration rules, allowing the parsing of set differencees and arbitrarily nested character set expressions, using non-NUL byte for the empty transition. Adding resulting data file.
|
2015-05-08 17:14:26 -04:00 |
|
Al
|
10ebaf147a
|
[transliteration] literal ^ and $ escaped
|
2015-05-01 19:16:36 -04:00 |
|