Commit Graph

308 Commits

Author SHA1 Message Date
Al
b58877ec6c [utils] string_is_lower/string_is_upper method 2015-07-01 14:49:22 -04:00
Al
58c6ff104a [fix] Russian feminine ordinals 2015-07-01 13:57:42 -04:00
Al
d0db015667 [geodisambig] Adding new fields to geonames struct, plus I/O 2015-07-01 13:02:00 -04:00
Al
af56c3cd09 [config] constants 2015-07-01 13:01:22 -04:00
Al
fa643f7a3a [utf8] Moving language length constant 2015-06-30 19:17:20 -04:00
Al
071d6bb392 [geodisambig] Adding presence of a Wikipedia link to the GeoNames output (an unqualified entry for the name in Wikipeida usually indicates a primary meaning). Ranking ambiguous entries for each term so that the top entry should be selected if no further information is available 2015-06-30 18:00:07 -04:00
Al
8d64c9301e [transliteration] Re-generating transliteration data file 2015-06-29 15:03:59 -04:00
Al
a580ed0b1b [transliteration] Adding numeric HTML escapes e.g. '&' 2015-06-29 15:02:34 -04:00
Al
3279b31b09 [tokenization] Adding an acronym token type for things like U.N. so we can delete internal periods on those tokens 2015-06-29 03:00:46 -04:00
Al
47efce4b7e [transliteration] Stopping set check loop on empty transition 2015-06-28 20:46:23 -04:00
Al
cc0401a8d1 [utf8] Adding a boolean struct member for string_script_t return values, set to true if the string is ASCII (no transliteration needed, should be frequent for English addresses) 2015-06-28 19:37:58 -04:00
Al
f0bf7e750c [transliteration] Fixing edge case in transliteration where a naked character fails context matching but the set-wrapped version matches 2015-06-28 15:19:19 -04:00
Al
a5dacf3d2b [utils] Adding method to get a particular token alternative from a string tree 2015-06-28 15:15:29 -04:00
Al
246237c1f1 [transliteration] Adding a get_transliteration_table() to foreach_transliterator macro since it lives in the header 2015-06-28 15:14:49 -04:00
Al
0f3bcaf49c [dictionaries] Flatter hierarchy for dictionaries 2015-06-26 13:14:14 -04:00
Al
7c161ee5b6 [numex] Regenerating numex data file 2015-06-26 12:36:40 -04:00
Al
d21f8135f3 [numex] Adding full stop ordinal indicators to German, Danish and Polish 2015-06-26 12:35:53 -04:00
Al
6a8ab48662 [numex] Adding method to get ordinal suffixes, using single representation 2015-06-25 17:28:06 -04:00
Al
9337bf9aea [phrases] trie_search_suffixes uses the NUL-byte prefix by default but the _from_index version can start from another node. fixing single character suffixes 2015-06-25 17:24:19 -04:00
Al
82e85732c4 [fix] Setting codepoint in utf8proc_iterate_reversed 2015-06-25 17:20:55 -04:00
Al
4fbcb72368 [fix] utf8proc option 2015-06-25 10:07:37 -04:00
Al
c376bcef3d [utils] get_string_script returns a struct rather than modifying a pointer for the length 2015-06-25 10:06:38 -04:00
Al
bcee9832b3 [utils] cstring_array_get_token=>cstring_array_get_string 2015-06-25 10:05:35 -04:00
Al
2b69c185fa [tokenization] Adding a tokenizer method for appending to an existing tokens array (e.g. can stop/start tokenizing on a script change) 2015-06-25 10:03:34 -04:00
Al
581cf406a6 [utf8] Adding length argument to string_script function 2015-06-24 13:39:09 -05:00
Al
5e71a9d805 [utf8] Adding method to get the script of a string and the length of the span (rolls Common script up with the previuos script) 2015-06-24 13:29:40 -05:00
Al
85348e1178 [fix] enum value conflicted with existing name 2015-06-23 15:38:59 -05:00
Al
077e7fd5e2 [transliteration] Adding script/language lookups and I/O 2015-06-23 15:35:52 -05:00
Al
423d9ca7b7 [transliteration] table builder adds script/language rules 2015-06-23 15:35:16 -05:00
Al
c3143e5291 [transliteration] Adding structs/header stuff for transliterator lookup by script/language 2015-06-23 15:34:38 -05:00
Al
8fb6a28e9c [fix] using empty string instead of NULL for script languages so we can use fixed length arrays 2015-06-23 15:20:09 -05:00
Al
f2d03a7937 [fix] renaming structure 2015-06-23 02:12:24 -05:00
Al
7dd772de0f [fix] implementation of cstring_array_split 2015-06-23 02:11:24 -05:00
Al
d4cae97fd3 [transliteration] regenerated scripts data file 2015-06-23 02:10:10 -05:00
Al
b21c3a3a2f [transliteration] using different struct in script data header file 2015-06-22 22:06:16 -05:00
Al
2e54ca3575 [transliteration] including script data file, adding len to transliterate API for tokenized transliteration 2015-06-21 05:42:20 -05:00
Al
79530ae974 [transliteration] Adding transliteration script data file 2015-06-21 05:39:06 -05:00
Al
c2b4744f55 [transliteration] Using a data file instead of a header for transliteration scripts 2015-06-21 05:37:56 -05:00
Al
b2e201f297 [fix] trailing comma 2015-06-20 15:14:41 -05:00
Al
f8bff25948 [bloom] bloom filter I/O 2015-06-20 12:29:11 -05:00
Al
0ed80c3f6e [geonames] Geonames generic serialization/deserialization 2015-06-20 12:00:15 -05:00
Al
d4087be40c [geonames] Pre-escaping tabs, no quoting in geonames/postal code TSVs 2015-06-20 11:54:47 -05:00
Al
ab1fb3669f [geonames] Only take alternative names that are != to the canonical name, sort by name, population desc, geonames_id 2015-06-19 15:47:50 -05:00
Al
bc306fc6c8 [fix] removing unused vars 2015-06-18 00:33:03 -04:00
Al
8792c38b52 [transliteration] Getting pre-context matching correct for > 1 char contexts, refining pre/post context matching in cases with an empty transition or an empty repeat, falling back to the original character in cases e.g. if there are Latin characters in a Hangul token 2015-06-17 23:51:19 -04:00
Al
be8353ad9b [transliteration] Regenerated script data 2015-06-17 23:46:29 -04:00
Al
2408cfa6f0 [transliteration] Re-generating data file 2015-06-17 23:45:56 -04:00
Al
84b9a6ff33 [transliteration] Adding Hangul-Latin and Jamo-Latin back into the mix with a restricted filter. Reversing all previous contexts by character group 2015-06-17 23:42:31 -04:00
Al
880d444881 [tokenization] Re-generating scanner 2015-06-16 12:52:37 -04:00
Al
77760f207c [tokenization] Adding a Hangul syllable class in tokenization for syllables written out as Jamo 2015-06-16 12:52:04 -04:00