Commit Graph

248 Commits

Author SHA1 Message Date
Al
73f37fe66b [fix] Moving default Geonames DB path to a shared module 2015-06-15 12:53:00 -04:00
Al
7a4fa7d443 [geodisambig] Canonical country names from CLDR, adding alpha-2 and alpha-3 surface forms, writing results to stdout or a file for streaming 2015-06-15 01:58:43 -04:00
Al
43e023077c [fix] Changing logging to stderr for the Geonames scripts 2015-06-14 15:38:57 -04:00
Al
e3dffc177c [fix] gazetteers typo 2015-06-12 17:26:14 -04:00
Al
5f5efad6ac [numex] Working numex implemenation. Tested on most languages, Germanic, Latin/whole_tokens_only, English concatenated or with separators, French numerals like quatre-vignt-douze, Spanish multiple-token ordinals, Japanese numerals, etc. All looking good 2015-06-12 16:21:36 -04:00
Al
c159f83f9b [fix] trie_search logging 2015-06-12 16:17:41 -04:00
Al
a100cd83c9 [numex] Re-generated numex data file 2015-06-12 16:15:53 -04:00
Al
8520df96c8 [utils] utf8 comparison can handle a non-valid UTF-8 sequence e.g. for trie suffix comparison where we may be in the middle of a multi-byte character. Adding a standard utf8_common_prefix method 2015-06-12 16:11:40 -04:00
Al
5c2839e534 [numx] header and table builder changes to support whole words languages 2015-06-12 16:10:57 -04:00
Al
1c4657b631 [numex] Setting Latin to whole_words_only 2015-06-12 16:10:07 -04:00
Al
fc735bb5c3 [numex] Adding a whole words only option on numex languages e.g. for Latin so we don't match an initial D with 500 2015-06-12 16:09:45 -04:00
Al
6b60446dbe [phrases] no longer ignoring spaces in the input string, just trying different methods for hyphens, getting indexes right in the case where a space or hyphen precedes the match and backtracking on matches if the rest of the string falls off the trie 2015-06-12 11:30:24 -04:00
Al
3442b9ad92 [utils] require at least one non-space/non-hyphen match in utf8_common_prefix_len_ignore_separators 2015-06-12 11:19:37 -04:00
Al
6841ed8fb3 [phrases] Ignoring separators and dashes in trie_search_prefixes so it can be used for languages like German where numbers, phrases, etc. may just be concatenated together as a single token 2015-06-11 11:05:56 -04:00
Al
ab5ea6d791 [utils] Common prefix-style return value instead of a utf8 strcmp 2015-06-11 10:59:51 -04:00
Al
aad5f3edd3 [utils] UTF-8 lowercasing and string comparison, including a version which ignores dashes/spaces 2015-06-10 18:27:14 -04:00
Al
cb603562e0 [phrases] Adding *_from_index methods to trie_search 2015-06-09 11:14:42 -04:00
Al
81be8e771e [numex] regen data file. utf8_is_hyphen requires a character, all other methods use category 2015-06-08 21:32:38 -04:00
Al
c1d0afa52c [fix] additional French numex 2015-06-08 21:30:32 -04:00
Al
c1bed8b410 [numex] header changes 2015-06-08 21:29:36 -04:00
Al
fd1ebba720 [numex] Initial implementation of multilingual numeric expression parser 2015-06-08 21:29:04 -04:00
Al
6267b3a431 [numex] Adding numex phrase structure to the API 2015-06-07 23:56:24 -04:00
Al
06835d5c37 [utils] string_utils category functions take a category instead of a codepoint 2015-06-06 20:41:07 -04:00
Al
fc250724e1 [numex] tercera=>3ra 2015-06-06 20:39:57 -04:00
Al
7c613a068f [dictionaries] English dictionary updates 2015-06-06 20:39:27 -04:00
Al
2856c2b401 [utils] string_utils category functions take a category instead of a codepoint 2015-06-05 16:55:21 -04:00
Al
3030dbe4be [fix] transliteration states 2015-06-05 00:09:29 -04:00
Al
e32916f3df [fix] closing file in numex table builder 2015-06-04 23:59:21 -04:00
Al
b244aa30f2 [numex] Setting numex_table to NULL during teardown, adding some logging 2015-06-04 23:57:52 -04:00
Al
3bd5172afd [numex] Adding NUMEX_NULL_RULE at the first index 2015-06-04 17:21:44 -04:00
Al
3400a59e1c [numex] adding a NUMEX_NULL_RULE 2015-06-04 17:21:16 -04:00
Al
95a4bb8e7c [numex] teardown in numex table builder 2015-06-04 17:20:26 -04:00
Al
114b728f96 [fix] var 2015-06-04 17:18:05 -04:00
Al
528dd05983 [numex] Adding utf8_is_number_or_letter 2015-06-04 14:49:12 -04:00
Al
ca746304e3 [utils] Adding a few methods to string_utils for finding utf8proc category groups 2015-06-04 13:20:14 -04:00
Al
eac7a296ba [numex] New numex data file including top 15 languages in OSM 2015-06-04 11:55:07 -04:00
Al
6470cbe467 [numex] Catalan and Chinese numex rules converted from RBNF, now covering top 15 languages in OSM addresses 2015-06-04 11:53:43 -04:00
Al
e2c8c08772 [numex] 1era for Spanish feminine ordinal indicator 2015-06-04 11:52:50 -04:00
Al
0429db3507 [numex] Adding ordinal indicator type for Japanese 2015-06-04 11:52:25 -04:00
Al
d98c535c52 [numex] Adding ordinal indicator to enum 2015-06-04 11:25:24 -04:00
Al
2d098fdab6 [numex] Adding ordinal_indicator rule type for CJK ordinals 2015-06-04 11:24:13 -04:00
Al
3cb8b2d297 [numex] trie builder adding a separate suffix-based namespace for looking up ordinal indicators 2015-06-04 03:17:03 -04:00
Al
7d3ef39463 [numex] struct/method changes for new ordinal indicators 2015-06-04 03:15:51 -04:00
Al
ab802bc361 [numex] Changes to existing numex rules files. Adding Dutch, Japanese, Polish, Danish, Swedish and Finnish numex rules (priority based on frequency in OpenStreetMap) 2015-06-04 03:13:39 -04:00
Al
65abde908b [numex] New numex data file 2015-06-04 03:10:00 -04:00
Al
4c49f63caf [numex] Adding categories to numex for plurals, etc. Ordinal indicators support multiple variants (primer in Spanish can be written as 1er or 1r for instance) and longer suffixes e.g. for tracking 1=>1st but 11=>11th 2015-06-04 03:09:39 -04:00
Al
3d95875a11 [phrases] trie_add_len 2015-06-04 02:41:48 -04:00
Al
fa784677f2 [phrases] trie_add_suffix_at_index method 2015-06-04 02:30:53 -04:00
Al
9bdf118423 [transliteration] Fix to transliteration in cases where the pre/post context doesn't match and we fall back to the no-context match 2015-06-03 22:58:29 -04:00
Al
48d2ca31c4 [transliteration] New ggenerated data file with the German/Scandinavian additions 2015-06-03 22:56:50 -04:00