a5dacf3d2b[utils] Adding method to get a particular token alternative from a string tree
Al
2015-06-28 15:15:29 -04:00
246237c1f1[transliteration] Adding a get_transliteration_table() to foreach_transliterator macro since it lives in the header
Al
2015-06-28 15:14:49 -04:00
0f3bcaf49c[dictionaries] Flatter hierarchy for dictionaries
Al
2015-06-26 13:14:14 -04:00
7c161ee5b6[numex] Regenerating numex data file
Al
2015-06-26 12:36:40 -04:00
d21f8135f3[numex] Adding full stop ordinal indicators to German, Danish and Polish
Al
2015-06-26 12:35:53 -04:00
6a8ab48662[numex] Adding method to get ordinal suffixes, using single representation
Al
2015-06-25 17:27:52 -04:00
9337bf9aea[phrases] trie_search_suffixes uses the NUL-byte prefix by default but the _from_index version can start from another node. fixing single character suffixes
Al
2015-06-25 17:24:19 -04:00
82e85732c4[fix] Setting codepoint in utf8proc_iterate_reversed
Al
2015-06-25 17:20:55 -04:00
4fbcb72368[fix] utf8proc option
Al
2015-06-25 10:07:37 -04:00
c376bcef3d[utils] get_string_script returns a struct rather than modifying a pointer for the length
Al
2015-06-25 10:06:38 -04:00
bcee9832b3[utils] cstring_array_get_token=>cstring_array_get_string
Al
2015-06-25 10:05:35 -04:00
2b69c185fa[tokenization] Adding a tokenizer method for appending to an existing tokens array (e.g. can stop/start tokenizing on a script change)
Al
2015-06-25 10:03:34 -04:00
581cf406a6[utf8] Adding length argument to string_script function
Al
2015-06-24 13:39:09 -05:00
5e71a9d805[utf8] Adding method to get the script of a string and the length of the span (rolls Common script up with the previuos script)
Al
2015-06-24 13:29:40 -05:00
85348e1178[fix] enum value conflicted with existing name
Al
2015-06-23 15:38:59 -05:00
077e7fd5e2[transliteration] Adding script/language lookups and I/O
Al
2015-06-23 15:35:52 -05:00
423d9ca7b7[transliteration] table builder adds script/language rules
Al
2015-06-23 15:35:16 -05:00
c3143e5291[transliteration] Adding structs/header stuff for transliterator lookup by script/language
Al
2015-06-23 15:34:38 -05:00
8fb6a28e9c[fix] using empty string instead of NULL for script languages so we can use fixed length arrays
Al
2015-06-23 15:17:18 -05:00
f2d03a7937[fix] renaming structure
Al
2015-06-23 02:11:58 -05:00
7dd772de0f[fix] implementation of cstring_array_split
Al
2015-06-23 02:11:24 -05:00
d4cae97fd3[transliteration] regenerated scripts data file
Al
2015-06-23 02:10:10 -05:00
b21c3a3a2f[transliteration] using different struct in script data header file
Al
2015-06-22 22:06:16 -05:00
2e54ca3575[transliteration] including script data file, adding len to transliterate API for tokenized transliteration
Al
2015-06-21 05:42:10 -05:00
79530ae974[transliteration] Adding transliteration script data file
Al
2015-06-21 05:39:06 -05:00
c2b4744f55[transliteration] Using a data file instead of a header for transliteration scripts
Al
2015-06-21 05:37:56 -05:00
b2e201f297[fix] trailing comma
Al
2015-06-20 15:14:41 -05:00
f8bff25948[bloom] bloom filter I/O
Al
2015-06-20 12:29:11 -05:00
0ed80c3f6e[geonames] Geonames generic serialization/deserialization
Al
2015-06-20 12:00:15 -05:00
d4087be40c[geonames] Pre-escaping tabs, no quoting in geonames/postal code TSVs
Al
2015-06-20 11:54:47 -05:00
ab1fb3669f[geonames] Only take alternative names that are != to the canonical name, sort by name, population desc, geonames_id
Al
2015-06-19 14:21:20 -05:00
bc306fc6c8[fix] removing unused vars
Al
2015-06-18 00:33:01 -04:00
8792c38b52[transliteration] Getting pre-context matching correct for > 1 char contexts, refining pre/post context matching in cases with an empty transition or an empty repeat, falling back to the original character in cases e.g. if there are Latin characters in a Hangul token
Al
2015-06-17 23:51:19 -04:00
be8353ad9b[transliteration] Regenerated script data
Al
2015-06-17 23:46:29 -04:00
2408cfa6f0[transliteration] Re-generating data file
Al
2015-06-17 23:45:56 -04:00
84b9a6ff33[transliteration] Adding Hangul-Latin and Jamo-Latin back into the mix with a restricted filter. Reversing all previous contexts by character group
Al
2015-06-17 23:33:51 -04:00
880d444881[tokenization] Re-generating scanner
Al
2015-06-16 12:52:37 -04:00
77760f207c[tokenization] Adding a Hangul syllable class in tokenization for syllables written out as Jamo
Al
2015-06-16 12:52:04 -04:00
f04fad0e93[i18n] Generating Hangul syllable classes
Al
2015-06-16 12:50:42 -04:00
cb2035867b[fix] osm geodata imports
Al
2015-06-15 18:36:01 -04:00
d2d25ead6f[utils] Adding unicode_csv module
Al
2015-06-15 18:06:54 -04:00
651f91fc11[polygons] Adding language exceptions, now including osm relation ids
Al
2015-06-15 18:04:44 -04:00
ccb64f7ac2[polygons] Adding address_normalizer polygons package
Al
2015-06-15 17:55:27 -04:00
22fa81b33f[fix] __init__.py
Al
2015-06-15 17:54:27 -04:00
41dbd97bf2[geodisambig] quattroshapes download can use default or specified location, unzips files
Al
2015-06-15 17:54:08 -04:00
037d4575ae[geodisambig] Modifying GeoNames TSV again. Using files again and sorting
Al
2015-06-15 17:51:09 -04:00
67bd9f1a31[i18n] Adding languages.py
Al
2015-06-15 17:48:47 -04:00
073fe43698[geodisambig] Adding quattroshapes download script
Al
2015-06-15 17:46:11 -04:00
73f37fe66b[fix] Moving default Geonames DB path to a shared module
Al
2015-06-15 12:53:00 -04:00
7a4fa7d443[geodisambig] Canonical country names from CLDR, adding alpha-2 and alpha-3 surface forms, writing results to stdout or a file for streaming
Al
2015-06-15 01:58:43 -04:00
43e023077c[fix] Changing logging to stderr for the Geonames scripts
Al
2015-06-14 15:38:52 -04:00
e3dffc177c[fix] gazetteers typo
Al
2015-06-12 17:26:14 -04:00
5f5efad6ac[numex] Working numex implemenation. Tested on most languages, Germanic, Latin/whole_tokens_only, English concatenated or with separators, French numerals like quatre-vignt-douze, Spanish multiple-token ordinals, Japanese numerals, etc. All looking good
Al
2015-06-12 16:21:36 -04:00
c159f83f9b[fix] trie_search logging
Al
2015-06-12 16:17:41 -04:00
a100cd83c9[numex] Re-generated numex data file
Al
2015-06-12 16:15:53 -04:00
8520df96c8[utils] utf8 comparison can handle a non-valid UTF-8 sequence e.g. for trie suffix comparison where we may be in the middle of a multi-byte character. Adding a standard utf8_common_prefix method
Al
2015-06-12 16:11:37 -04:00
5c2839e534[numx] header and table builder changes to support whole words languages
Al
2015-06-12 16:10:53 -04:00
1c4657b631[numex] Setting Latin to whole_words_only
Al
2015-06-12 16:10:02 -04:00
fc735bb5c3[numex] Adding a whole words only option on numex languages e.g. for Latin so we don't match an initial D with 500
Al
2015-06-12 16:09:45 -04:00
6b60446dbe[phrases] no longer ignoring spaces in the input string, just trying different methods for hyphens, getting indexes right in the case where a space or hyphen precedes the match and backtracking on matches if the rest of the string falls off the trie
Al
2015-06-12 11:29:19 -04:00
3442b9ad92[utils] require at least one non-space/non-hyphen match in utf8_common_prefix_len_ignore_separators
Al
2015-06-12 11:13:49 -04:00
6841ed8fb3[phrases] Ignoring separators and dashes in trie_search_prefixes so it can be used for languages like German where numbers, phrases, etc. may just be concatenated together as a single token
Al
2015-06-11 11:05:56 -04:00
ab5ea6d791[utils] Common prefix-style return value instead of a utf8 strcmp
Al
2015-06-11 10:59:51 -04:00
aad5f3edd3[utils] UTF-8 lowercasing and string comparison, including a version which ignores dashes/spaces
Al
2015-06-10 18:26:52 -04:00
cb603562e0[phrases] Adding *_from_index methods to trie_search
Al
2015-06-09 11:14:42 -04:00
81be8e771e[numex] regen data file. utf8_is_hyphen requires a character, all other methods use category
Al
2015-06-08 21:32:01 -04:00
c1d0afa52c[fix] additional French numex
Al
2015-06-08 21:30:32 -04:00
c1bed8b410[numex] header changes
Al
2015-06-08 21:29:36 -04:00
fd1ebba720[numex] Initial implementation of multilingual numeric expression parser
Al
2015-06-08 21:29:04 -04:00
6267b3a431[numex] Adding numex phrase structure to the API
Al
2015-06-07 23:56:24 -04:00
06835d5c37[utils] string_utils category functions take a category instead of a codepoint
Al
2015-06-06 20:41:07 -04:00
fc250724e1[numex] tercera=>3ra
Al
2015-06-06 20:39:57 -04:00
7c613a068f[dictionaries] English dictionary updates
Al
2015-06-06 20:39:27 -04:00
2856c2b401[utils] string_utils category functions take a category instead of a codepoint
Al
2015-06-05 16:55:21 -04:00
3030dbe4be[fix] transliteration states
Al
2015-06-05 00:09:29 -04:00
e32916f3df[fix] closing file in numex table builder
Al
2015-06-04 23:59:21 -04:00
b244aa30f2[numex] Setting numex_table to NULL during teardown, adding some logging
Al
2015-06-04 23:57:52 -04:00
3bd5172afd[numex] Adding NUMEX_NULL_RULE at the first index
Al
2015-06-04 17:21:44 -04:00
3400a59e1c[numex] adding a NUMEX_NULL_RULE
Al
2015-06-04 17:21:16 -04:00
95a4bb8e7c[numex] teardown in numex table builder
Al
2015-06-04 17:20:26 -04:00
528dd05983[numex] Adding utf8_is_number_or_letter
Al
2015-06-04 14:49:12 -04:00
ca746304e3[utils] Adding a few methods to string_utils for finding utf8proc category groups
Al
2015-06-04 13:20:14 -04:00
eac7a296ba[numex] New numex data file including top 15 languages in OSM
Al
2015-06-04 11:55:07 -04:00
6470cbe467[numex] Catalan and Chinese numex rules converted from RBNF, now covering top 15 languages in OSM addresses
Al
2015-06-04 11:53:36 -04:00
e2c8c08772[numex] 1era for Spanish feminine ordinal indicator
Al
2015-06-04 11:52:50 -04:00
0429db3507[numex] Adding ordinal indicator type for Japanese
Al
2015-06-04 11:52:23 -04:00
d98c535c52[numex] Adding ordinal indicator to enum
Al
2015-06-04 11:25:24 -04:00
2d098fdab6[numex] Adding ordinal_indicator rule type for CJK ordinals
Al
2015-06-04 11:24:13 -04:00
3cb8b2d297[numex] trie builder adding a separate suffix-based namespace for looking up ordinal indicators
Al
2015-06-04 03:17:03 -04:00
7d3ef39463[numex] struct/method changes for new ordinal indicators
Al
2015-06-04 03:14:44 -04:00
ab802bc361[numex] Changes to existing numex rules files. Adding Dutch, Japanese, Polish, Danish, Swedish and Finnish numex rules (priority based on frequency in OpenStreetMap)
Al
2015-06-04 03:13:39 -04:00
65abde908b[numex] New numex data file
Al
2015-06-04 03:10:00 -04:00
4c49f63caf[numex] Adding categories to numex for plurals, etc. Ordinal indicators support multiple variants (primer in Spanish can be written as 1er or 1r for instance) and longer suffixes e.g. for tracking 1=>1st but 11=>11th
Al
2015-06-04 03:09:39 -04:00
3d95875a11[phrases] trie_add_len
Al
2015-06-04 02:41:48 -04:00
fa784677f2[phrases] trie_add_suffix_at_index method
Al
2015-06-04 02:30:53 -04:00
9bdf118423[transliteration] Fix to transliteration in cases where the pre/post context doesn't match and we fall back to the no-context match
Al
2015-06-03 22:58:29 -04:00
48d2ca31c4[transliteration] New ggenerated data file with the German/Scandinavian additions
Al
2015-06-03 22:56:43 -04:00
b2fe9d4db0[transliteration] Adding uppercase umlauts and Scandinativan a-ring
Al
2015-06-03 22:55:45 -04:00
760714a234[fix] warnings in transliterate.c
Al
2015-06-03 19:29:35 -04:00