Commit Graph

267 Commits

Author SHA1 Message Date
Al
d4087be40c [geonames] Pre-escaping tabs, no quoting in geonames/postal code TSVs 2015-06-20 11:54:47 -05:00
Al
ab1fb3669f [geonames] Only take alternative names that are != to the canonical name, sort by name, population desc, geonames_id 2015-06-19 15:47:50 -05:00
Al
bc306fc6c8 [fix] removing unused vars 2015-06-18 00:33:03 -04:00
Al
8792c38b52 [transliteration] Getting pre-context matching correct for > 1 char contexts, refining pre/post context matching in cases with an empty transition or an empty repeat, falling back to the original character in cases e.g. if there are Latin characters in a Hangul token 2015-06-17 23:51:19 -04:00
Al
be8353ad9b [transliteration] Regenerated script data 2015-06-17 23:46:29 -04:00
Al
2408cfa6f0 [transliteration] Re-generating data file 2015-06-17 23:45:56 -04:00
Al
84b9a6ff33 [transliteration] Adding Hangul-Latin and Jamo-Latin back into the mix with a restricted filter. Reversing all previous contexts by character group 2015-06-17 23:42:31 -04:00
Al
880d444881 [tokenization] Re-generating scanner 2015-06-16 12:52:37 -04:00
Al
77760f207c [tokenization] Adding a Hangul syllable class in tokenization for syllables written out as Jamo 2015-06-16 12:52:04 -04:00
Al
f04fad0e93 [i18n] Generating Hangul syllable classes 2015-06-16 12:50:48 -04:00
Al
cb2035867b [fix] osm geodata imports 2015-06-15 18:36:01 -04:00
Al
d2d25ead6f [utils] Adding unicode_csv module 2015-06-15 18:06:54 -04:00
Al
651f91fc11 [polygons] Adding language exceptions, now including osm relation ids 2015-06-15 18:04:44 -04:00
Al
ccb64f7ac2 [polygons] Adding address_normalizer polygons package 2015-06-15 17:55:27 -04:00
Al
22fa81b33f [fix] __init__.py 2015-06-15 17:54:27 -04:00
Al
41dbd97bf2 [geodisambig] quattroshapes download can use default or specified location, unzips files 2015-06-15 17:54:08 -04:00
Al
037d4575ae [geodisambig] Modifying GeoNames TSV again. Using files again and sorting 2015-06-15 17:51:09 -04:00
Al
67bd9f1a31 [i18n] Adding languages.py 2015-06-15 17:48:47 -04:00
Al
073fe43698 [geodisambig] Adding quattroshapes download script 2015-06-15 17:46:11 -04:00
Al
73f37fe66b [fix] Moving default Geonames DB path to a shared module 2015-06-15 12:53:00 -04:00
Al
7a4fa7d443 [geodisambig] Canonical country names from CLDR, adding alpha-2 and alpha-3 surface forms, writing results to stdout or a file for streaming 2015-06-15 01:58:43 -04:00
Al
43e023077c [fix] Changing logging to stderr for the Geonames scripts 2015-06-14 15:38:57 -04:00
Al
e3dffc177c [fix] gazetteers typo 2015-06-12 17:26:14 -04:00
Al
5f5efad6ac [numex] Working numex implemenation. Tested on most languages, Germanic, Latin/whole_tokens_only, English concatenated or with separators, French numerals like quatre-vignt-douze, Spanish multiple-token ordinals, Japanese numerals, etc. All looking good 2015-06-12 16:21:36 -04:00
Al
c159f83f9b [fix] trie_search logging 2015-06-12 16:17:41 -04:00
Al
a100cd83c9 [numex] Re-generated numex data file 2015-06-12 16:15:53 -04:00
Al
8520df96c8 [utils] utf8 comparison can handle a non-valid UTF-8 sequence e.g. for trie suffix comparison where we may be in the middle of a multi-byte character. Adding a standard utf8_common_prefix method 2015-06-12 16:11:40 -04:00
Al
5c2839e534 [numx] header and table builder changes to support whole words languages 2015-06-12 16:10:57 -04:00
Al
1c4657b631 [numex] Setting Latin to whole_words_only 2015-06-12 16:10:07 -04:00
Al
fc735bb5c3 [numex] Adding a whole words only option on numex languages e.g. for Latin so we don't match an initial D with 500 2015-06-12 16:09:45 -04:00
Al
6b60446dbe [phrases] no longer ignoring spaces in the input string, just trying different methods for hyphens, getting indexes right in the case where a space or hyphen precedes the match and backtracking on matches if the rest of the string falls off the trie 2015-06-12 11:30:24 -04:00
Al
3442b9ad92 [utils] require at least one non-space/non-hyphen match in utf8_common_prefix_len_ignore_separators 2015-06-12 11:19:37 -04:00
Al
6841ed8fb3 [phrases] Ignoring separators and dashes in trie_search_prefixes so it can be used for languages like German where numbers, phrases, etc. may just be concatenated together as a single token 2015-06-11 11:05:56 -04:00
Al
ab5ea6d791 [utils] Common prefix-style return value instead of a utf8 strcmp 2015-06-11 10:59:51 -04:00
Al
aad5f3edd3 [utils] UTF-8 lowercasing and string comparison, including a version which ignores dashes/spaces 2015-06-10 18:27:14 -04:00
Al
cb603562e0 [phrases] Adding *_from_index methods to trie_search 2015-06-09 11:14:42 -04:00
Al
81be8e771e [numex] regen data file. utf8_is_hyphen requires a character, all other methods use category 2015-06-08 21:32:38 -04:00
Al
c1d0afa52c [fix] additional French numex 2015-06-08 21:30:32 -04:00
Al
c1bed8b410 [numex] header changes 2015-06-08 21:29:36 -04:00
Al
fd1ebba720 [numex] Initial implementation of multilingual numeric expression parser 2015-06-08 21:29:04 -04:00
Al
6267b3a431 [numex] Adding numex phrase structure to the API 2015-06-07 23:56:24 -04:00
Al
06835d5c37 [utils] string_utils category functions take a category instead of a codepoint 2015-06-06 20:41:07 -04:00
Al
fc250724e1 [numex] tercera=>3ra 2015-06-06 20:39:57 -04:00
Al
7c613a068f [dictionaries] English dictionary updates 2015-06-06 20:39:27 -04:00
Al
2856c2b401 [utils] string_utils category functions take a category instead of a codepoint 2015-06-05 16:55:21 -04:00
Al
3030dbe4be [fix] transliteration states 2015-06-05 00:09:29 -04:00
Al
e32916f3df [fix] closing file in numex table builder 2015-06-04 23:59:21 -04:00
Al
b244aa30f2 [numex] Setting numex_table to NULL during teardown, adding some logging 2015-06-04 23:57:52 -04:00
Al
3bd5172afd [numex] Adding NUMEX_NULL_RULE at the first index 2015-06-04 17:21:44 -04:00
Al
3400a59e1c [numex] adding a NUMEX_NULL_RULE 2015-06-04 17:21:16 -04:00