libpostal

Author	SHA1	Message	Date
Al	d97c725bbc	[languages] Allowing specification of multiple regional languages	2015-08-18 03:18:52 -04:00
Al	03febc7e20	[scripts] Better script code aliasing	2015-08-13 18:25:55 -04:00
Al	b54ff95ecc	[mv] csv_utils	2015-08-13 18:19:54 -04:00
Al	cf70615850	[transliteration] Doing HTML escapes first in Latin-ASCII transliteration as they may need to be resolved further in subsequent steps	2015-08-11 23:10:55 -04:00
Al	51addec5f2	[fix] check for local CLDR in unicode properties	2015-08-11 20:23:48 -04:00
Al	882e4c2ab8	[fix] ensure CLDR dir	2015-08-11 20:04:42 -04:00
Al	48566bf097	[fix] cldr languages dir	2015-08-11 20:04:25 -04:00
Al	dd391eabe5	[numex] Separating rules from keys for Linux gcc compilation	2015-08-09 01:00:57 -04:00
Al	1d39916aaa	[fix] Fixing warnings in unicode script data	2015-08-02 21:30:54 -06:00
Al	87566bb6a5	[numex] Adding validation checks for numex JSON	2015-07-24 15:22:07 -04:00
Al	64a63fdf51	[mv] Moving all repo data files to a resources dir, data is only for runtime files	2015-07-21 18:11:36 -04:00
Al	076c07e21f	[fix] Add minor languages to the language set	2015-07-16 00:58:58 -04:00
Al	95a6845a85	[i18n] Adding regional languages as valid country languages	2015-07-08 14:54:00 -04:00
Al	a580ed0b1b	[transliteration] Adding numeric HTML escapes e.g. '&'	2015-06-29 15:02:34 -04:00
Al	8fb6a28e9c	[fix] using empty string instead of NULL for script languages so we can use fixed length arrays	2015-06-23 15:20:09 -05:00
Al	b21c3a3a2f	[transliteration] using different struct in script data header file	2015-06-22 22:06:16 -05:00
Al	c2b4744f55	[transliteration] Using a data file instead of a header for transliteration scripts	2015-06-21 05:37:56 -05:00
Al	84b9a6ff33	[transliteration] Adding Hangul-Latin and Jamo-Latin back into the mix with a restricted filter. Reversing all previous contexts by character group	2015-06-17 23:42:31 -04:00
Al	f04fad0e93	[i18n] Generating Hangul syllable classes	2015-06-16 12:50:48 -04:00
Al	67bd9f1a31	[i18n] Adding languages.py	2015-06-15 17:48:47 -04:00
Al	fc735bb5c3	[numex] Adding a whole words only option on numex languages e.g. for Latin so we don't match an initial D with 500	2015-06-12 16:09:45 -04:00
Al	2d098fdab6	[numex] Adding ordinal_indicator rule type for CJK ordinals	2015-06-04 11:24:13 -04:00
Al	4c49f63caf	[numex] Adding categories to numex for plurals, etc. Ordinal indicators support multiple variants (primer in Spanish can be written as 1er or 1r for instance) and longer suffixes e.g. for tracking 1=>1st but 11=>11th	2015-06-04 03:09:39 -04:00
Al	b2fe9d4db0	[transliteration] Adding uppercase umlauts and Scandinativan a-ring	2015-06-03 22:55:45 -04:00
Al	2ea21dfffb	[fix] constants	2015-06-02 13:44:25 -04:00
Al	208366af98	[fix] removing stopwords index	2015-06-02 12:43:48 -04:00
Al	9d0d83bc14	[numex] adding stopword rules with the regular numex rules	2015-06-02 12:37:22 -04:00
Al	4ad978f22c	[numex] Using the new representation for generated data	2015-06-02 12:28:07 -04:00
Al	2dc870b3da	[numex] Python script to generate numex data	2015-06-02 10:15:02 -04:00
Al	6b3d434c31	[fix] removing unnecessary definition	2015-06-01 17:13:57 -04:00
Al	9c935c9cc7	[fix] Base data dir path	2015-06-01 17:13:06 -04:00
Al	6ac4ff6021	[transliteration] Adding reverse/bidirectional transforms e.g. for Katakana-Latin	2015-05-31 02:07:36 -04:00
Al	9547c93a38	[fix] InterIndic-Latin is an internal transliterator, but needed for most of the Indic languages. Also fixing the string lengths for HTML entity replacements	2015-05-29 19:47:49 -04:00
Al	a278cfd12c	[transliteration] Using revisit strings instead of keeping a backtrack count so we don't have to later map logical characters to the actual string, removing any duplicate keys in the table builder so that if any rules happen to overlap within a step, the first will take precedence	2015-05-29 16:54:05 -04:00
Al	a9d5b91ac0	[transliteration] Not counting repeat character in group capture	2015-05-28 19:36:25 -04:00
Al	c00ecf6ea8	[fix] minimizing c* into (c\|'')+, using empty transition instead of zero-length string	2015-05-22 18:11:54 -04:00
Al	b2d15b29cf	[fix] greek_latin_ungegn => greek-latin-ungegn	2015-05-22 09:52:08 -04:00
Al	d65f7747f0	[transliteration] Adding html escapes as the first step in the Latin-ASCII transformation	2015-05-20 14:44:55 -04:00
Al	4694371cdc	[fix] unicode escaping the German transliterations	2015-05-18 13:55:57 -04:00
Al	e25f039ee4	[transliteration] Escaped single quotes in rules + ignoring rules with codepoints > \uffff	2015-05-17 18:31:35 -04:00
Al	d72348d47e	[transliteratin] Using a restricted set of diacritical marks relevant to Greek, variants stand in for transliterator dependencies e.g. use Katakana-Latin-BGN if Katakana-Latin cannot be found	2015-05-17 17:42:37 -04:00
Al	30db201e8a	[fix] NUM_CHARS => NUM_CODEPOINTS	2015-05-17 13:53:19 -04:00
Al	99115fa53c	[transliteration] converting one of the more complicated and frequently used rules to its utf8proc equivalent, adding better support for escaped unicode characters and set differences, generating a header file indicating which unicode script/language pairs warrant various transliterators.	2015-05-16 23:13:01 -04:00
Al	5983cb6af0	[i18n] Adding NUM_SCRIPTS to the end of the scripts enum	2015-05-16 12:19:40 -04:00
Al	1f3ac0c3f9	[transliteration] using a proper lexer on the entire rule to correct some parses, allowing bracketed multiple characters in sets, fixing optionals	2015-05-14 16:34:03 -04:00
Al	304dc9525a	[transliteration] fixing variable assignments, literal wide characters (for narrow Python builds), ignoring rules related to spaced Han	2015-05-13 16:20:52 -04:00
Al	5bbf71ccbb	[transliteration] Using breadth-first search for tracking dependencies between transforms, removing Han-Spacedhan since our tokenizer does the equivalent already	2015-05-12 18:57:57 -04:00
Al	b55db5fcda	[fix] usage text	2015-05-12 12:15:51 -04:00
Al	d5f9d8a29a	[mv] unicode_scripts => unicode_properties	2015-05-12 12:14:59 -04:00
Al	ff0e7cb9e1	[i18n] downloading several files from the Unicode Character Database	2015-05-12 12:12:17 -04:00

1 2

70 Commits