Commit Graph

  • a5dacf3d2b [utils] Adding method to get a particular token alternative from a string tree Al 2015-06-28 15:15:29 -04:00
  • 246237c1f1 [transliteration] Adding a get_transliteration_table() to foreach_transliterator macro since it lives in the header Al 2015-06-28 15:14:49 -04:00
  • 0f3bcaf49c [dictionaries] Flatter hierarchy for dictionaries Al 2015-06-26 13:14:14 -04:00
  • 7c161ee5b6 [numex] Regenerating numex data file Al 2015-06-26 12:36:40 -04:00
  • d21f8135f3 [numex] Adding full stop ordinal indicators to German, Danish and Polish Al 2015-06-26 12:35:53 -04:00
  • 6a8ab48662 [numex] Adding method to get ordinal suffixes, using single representation Al 2015-06-25 17:27:52 -04:00
  • 9337bf9aea [phrases] trie_search_suffixes uses the NUL-byte prefix by default but the _from_index version can start from another node. fixing single character suffixes Al 2015-06-25 17:24:19 -04:00
  • 82e85732c4 [fix] Setting codepoint in utf8proc_iterate_reversed Al 2015-06-25 17:20:55 -04:00
  • 4fbcb72368 [fix] utf8proc option Al 2015-06-25 10:07:37 -04:00
  • c376bcef3d [utils] get_string_script returns a struct rather than modifying a pointer for the length Al 2015-06-25 10:06:38 -04:00
  • bcee9832b3 [utils] cstring_array_get_token=>cstring_array_get_string Al 2015-06-25 10:05:35 -04:00
  • 2b69c185fa [tokenization] Adding a tokenizer method for appending to an existing tokens array (e.g. can stop/start tokenizing on a script change) Al 2015-06-25 10:03:34 -04:00
  • 581cf406a6 [utf8] Adding length argument to string_script function Al 2015-06-24 13:39:09 -05:00
  • 5e71a9d805 [utf8] Adding method to get the script of a string and the length of the span (rolls Common script up with the previuos script) Al 2015-06-24 13:29:40 -05:00
  • 85348e1178 [fix] enum value conflicted with existing name Al 2015-06-23 15:38:59 -05:00
  • 077e7fd5e2 [transliteration] Adding script/language lookups and I/O Al 2015-06-23 15:35:52 -05:00
  • 423d9ca7b7 [transliteration] table builder adds script/language rules Al 2015-06-23 15:35:16 -05:00
  • c3143e5291 [transliteration] Adding structs/header stuff for transliterator lookup by script/language Al 2015-06-23 15:34:38 -05:00
  • 8fb6a28e9c [fix] using empty string instead of NULL for script languages so we can use fixed length arrays Al 2015-06-23 15:17:18 -05:00
  • f2d03a7937 [fix] renaming structure Al 2015-06-23 02:11:58 -05:00
  • 7dd772de0f [fix] implementation of cstring_array_split Al 2015-06-23 02:11:24 -05:00
  • d4cae97fd3 [transliteration] regenerated scripts data file Al 2015-06-23 02:10:10 -05:00
  • b21c3a3a2f [transliteration] using different struct in script data header file Al 2015-06-22 22:06:16 -05:00
  • 2e54ca3575 [transliteration] including script data file, adding len to transliterate API for tokenized transliteration Al 2015-06-21 05:42:10 -05:00
  • 79530ae974 [transliteration] Adding transliteration script data file Al 2015-06-21 05:39:06 -05:00
  • c2b4744f55 [transliteration] Using a data file instead of a header for transliteration scripts Al 2015-06-21 05:37:56 -05:00
  • b2e201f297 [fix] trailing comma Al 2015-06-20 15:14:41 -05:00
  • f8bff25948 [bloom] bloom filter I/O Al 2015-06-20 12:29:11 -05:00
  • 0ed80c3f6e [geonames] Geonames generic serialization/deserialization Al 2015-06-20 12:00:15 -05:00
  • d4087be40c [geonames] Pre-escaping tabs, no quoting in geonames/postal code TSVs Al 2015-06-20 11:54:47 -05:00
  • ab1fb3669f [geonames] Only take alternative names that are != to the canonical name, sort by name, population desc, geonames_id Al 2015-06-19 14:21:20 -05:00
  • bc306fc6c8 [fix] removing unused vars Al 2015-06-18 00:33:01 -04:00
  • 8792c38b52 [transliteration] Getting pre-context matching correct for > 1 char contexts, refining pre/post context matching in cases with an empty transition or an empty repeat, falling back to the original character in cases e.g. if there are Latin characters in a Hangul token Al 2015-06-17 23:51:19 -04:00
  • be8353ad9b [transliteration] Regenerated script data Al 2015-06-17 23:46:29 -04:00
  • 2408cfa6f0 [transliteration] Re-generating data file Al 2015-06-17 23:45:56 -04:00
  • 84b9a6ff33 [transliteration] Adding Hangul-Latin and Jamo-Latin back into the mix with a restricted filter. Reversing all previous contexts by character group Al 2015-06-17 23:33:51 -04:00
  • 880d444881 [tokenization] Re-generating scanner Al 2015-06-16 12:52:37 -04:00
  • 77760f207c [tokenization] Adding a Hangul syllable class in tokenization for syllables written out as Jamo Al 2015-06-16 12:52:04 -04:00
  • f04fad0e93 [i18n] Generating Hangul syllable classes Al 2015-06-16 12:50:42 -04:00
  • cb2035867b [fix] osm geodata imports Al 2015-06-15 18:36:01 -04:00
  • d2d25ead6f [utils] Adding unicode_csv module Al 2015-06-15 18:06:54 -04:00
  • 651f91fc11 [polygons] Adding language exceptions, now including osm relation ids Al 2015-06-15 18:04:44 -04:00
  • ccb64f7ac2 [polygons] Adding address_normalizer polygons package Al 2015-06-15 17:55:27 -04:00
  • 22fa81b33f [fix] __init__.py Al 2015-06-15 17:54:27 -04:00
  • 41dbd97bf2 [geodisambig] quattroshapes download can use default or specified location, unzips files Al 2015-06-15 17:54:08 -04:00
  • 037d4575ae [geodisambig] Modifying GeoNames TSV again. Using files again and sorting Al 2015-06-15 17:51:09 -04:00
  • 67bd9f1a31 [i18n] Adding languages.py Al 2015-06-15 17:48:47 -04:00
  • 073fe43698 [geodisambig] Adding quattroshapes download script Al 2015-06-15 17:46:11 -04:00
  • 73f37fe66b [fix] Moving default Geonames DB path to a shared module Al 2015-06-15 12:53:00 -04:00
  • 7a4fa7d443 [geodisambig] Canonical country names from CLDR, adding alpha-2 and alpha-3 surface forms, writing results to stdout or a file for streaming Al 2015-06-15 01:58:43 -04:00
  • 43e023077c [fix] Changing logging to stderr for the Geonames scripts Al 2015-06-14 15:38:52 -04:00
  • e3dffc177c [fix] gazetteers typo Al 2015-06-12 17:26:14 -04:00
  • 5f5efad6ac [numex] Working numex implemenation. Tested on most languages, Germanic, Latin/whole_tokens_only, English concatenated or with separators, French numerals like quatre-vignt-douze, Spanish multiple-token ordinals, Japanese numerals, etc. All looking good Al 2015-06-12 16:21:36 -04:00
  • c159f83f9b [fix] trie_search logging Al 2015-06-12 16:17:41 -04:00
  • a100cd83c9 [numex] Re-generated numex data file Al 2015-06-12 16:15:53 -04:00
  • 8520df96c8 [utils] utf8 comparison can handle a non-valid UTF-8 sequence e.g. for trie suffix comparison where we may be in the middle of a multi-byte character. Adding a standard utf8_common_prefix method Al 2015-06-12 16:11:37 -04:00
  • 5c2839e534 [numx] header and table builder changes to support whole words languages Al 2015-06-12 16:10:53 -04:00
  • 1c4657b631 [numex] Setting Latin to whole_words_only Al 2015-06-12 16:10:02 -04:00
  • fc735bb5c3 [numex] Adding a whole words only option on numex languages e.g. for Latin so we don't match an initial D with 500 Al 2015-06-12 16:09:45 -04:00
  • 6b60446dbe [phrases] no longer ignoring spaces in the input string, just trying different methods for hyphens, getting indexes right in the case where a space or hyphen precedes the match and backtracking on matches if the rest of the string falls off the trie Al 2015-06-12 11:29:19 -04:00
  • 3442b9ad92 [utils] require at least one non-space/non-hyphen match in utf8_common_prefix_len_ignore_separators Al 2015-06-12 11:13:49 -04:00
  • 6841ed8fb3 [phrases] Ignoring separators and dashes in trie_search_prefixes so it can be used for languages like German where numbers, phrases, etc. may just be concatenated together as a single token Al 2015-06-11 11:05:56 -04:00
  • ab5ea6d791 [utils] Common prefix-style return value instead of a utf8 strcmp Al 2015-06-11 10:59:51 -04:00
  • aad5f3edd3 [utils] UTF-8 lowercasing and string comparison, including a version which ignores dashes/spaces Al 2015-06-10 18:26:52 -04:00
  • cb603562e0 [phrases] Adding *_from_index methods to trie_search Al 2015-06-09 11:14:42 -04:00
  • 81be8e771e [numex] regen data file. utf8_is_hyphen requires a character, all other methods use category Al 2015-06-08 21:32:01 -04:00
  • c1d0afa52c [fix] additional French numex Al 2015-06-08 21:30:32 -04:00
  • c1bed8b410 [numex] header changes Al 2015-06-08 21:29:36 -04:00
  • fd1ebba720 [numex] Initial implementation of multilingual numeric expression parser Al 2015-06-08 21:29:04 -04:00
  • 6267b3a431 [numex] Adding numex phrase structure to the API Al 2015-06-07 23:56:24 -04:00
  • 06835d5c37 [utils] string_utils category functions take a category instead of a codepoint Al 2015-06-06 20:41:07 -04:00
  • fc250724e1 [numex] tercera=>3ra Al 2015-06-06 20:39:57 -04:00
  • 7c613a068f [dictionaries] English dictionary updates Al 2015-06-06 20:39:27 -04:00
  • 2856c2b401 [utils] string_utils category functions take a category instead of a codepoint Al 2015-06-05 16:55:21 -04:00
  • 3030dbe4be [fix] transliteration states Al 2015-06-05 00:09:29 -04:00
  • e32916f3df [fix] closing file in numex table builder Al 2015-06-04 23:59:21 -04:00
  • b244aa30f2 [numex] Setting numex_table to NULL during teardown, adding some logging Al 2015-06-04 23:57:52 -04:00
  • 3bd5172afd [numex] Adding NUMEX_NULL_RULE at the first index Al 2015-06-04 17:21:44 -04:00
  • 3400a59e1c [numex] adding a NUMEX_NULL_RULE Al 2015-06-04 17:21:16 -04:00
  • 95a4bb8e7c [numex] teardown in numex table builder Al 2015-06-04 17:20:26 -04:00
  • 114b728f96 [fix] var Al 2015-06-04 17:18:05 -04:00
  • 528dd05983 [numex] Adding utf8_is_number_or_letter Al 2015-06-04 14:49:12 -04:00
  • ca746304e3 [utils] Adding a few methods to string_utils for finding utf8proc category groups Al 2015-06-04 13:20:14 -04:00
  • eac7a296ba [numex] New numex data file including top 15 languages in OSM Al 2015-06-04 11:55:07 -04:00
  • 6470cbe467 [numex] Catalan and Chinese numex rules converted from RBNF, now covering top 15 languages in OSM addresses Al 2015-06-04 11:53:36 -04:00
  • e2c8c08772 [numex] 1era for Spanish feminine ordinal indicator Al 2015-06-04 11:52:50 -04:00
  • 0429db3507 [numex] Adding ordinal indicator type for Japanese Al 2015-06-04 11:52:23 -04:00
  • d98c535c52 [numex] Adding ordinal indicator to enum Al 2015-06-04 11:25:24 -04:00
  • 2d098fdab6 [numex] Adding ordinal_indicator rule type for CJK ordinals Al 2015-06-04 11:24:13 -04:00
  • 3cb8b2d297 [numex] trie builder adding a separate suffix-based namespace for looking up ordinal indicators Al 2015-06-04 03:17:03 -04:00
  • 7d3ef39463 [numex] struct/method changes for new ordinal indicators Al 2015-06-04 03:14:44 -04:00
  • ab802bc361 [numex] Changes to existing numex rules files. Adding Dutch, Japanese, Polish, Danish, Swedish and Finnish numex rules (priority based on frequency in OpenStreetMap) Al 2015-06-04 03:13:39 -04:00
  • 65abde908b [numex] New numex data file Al 2015-06-04 03:10:00 -04:00
  • 4c49f63caf [numex] Adding categories to numex for plurals, etc. Ordinal indicators support multiple variants (primer in Spanish can be written as 1er or 1r for instance) and longer suffixes e.g. for tracking 1=>1st but 11=>11th Al 2015-06-04 03:09:39 -04:00
  • 3d95875a11 [phrases] trie_add_len Al 2015-06-04 02:41:48 -04:00
  • fa784677f2 [phrases] trie_add_suffix_at_index method Al 2015-06-04 02:30:53 -04:00
  • 9bdf118423 [transliteration] Fix to transliteration in cases where the pre/post context doesn't match and we fall back to the no-context match Al 2015-06-03 22:58:29 -04:00
  • 48d2ca31c4 [transliteration] New ggenerated data file with the German/Scandinavian additions Al 2015-06-03 22:56:43 -04:00
  • b2fe9d4db0 [transliteration] Adding uppercase umlauts and Scandinativan a-ring Al 2015-06-03 22:55:45 -04:00
  • 760714a234 [fix] warnings in transliterate.c Al 2015-06-03 19:29:35 -04:00