Al
|
bc306fc6c8
|
[fix] removing unused vars
|
2015-06-18 00:33:03 -04:00 |
|
Al
|
8792c38b52
|
[transliteration] Getting pre-context matching correct for > 1 char contexts, refining pre/post context matching in cases with an empty transition or an empty repeat, falling back to the original character in cases e.g. if there are Latin characters in a Hangul token
|
2015-06-17 23:51:19 -04:00 |
|
Al
|
be8353ad9b
|
[transliteration] Regenerated script data
|
2015-06-17 23:46:29 -04:00 |
|
Al
|
2408cfa6f0
|
[transliteration] Re-generating data file
|
2015-06-17 23:45:56 -04:00 |
|
Al
|
84b9a6ff33
|
[transliteration] Adding Hangul-Latin and Jamo-Latin back into the mix with a restricted filter. Reversing all previous contexts by character group
|
2015-06-17 23:42:31 -04:00 |
|
Al
|
880d444881
|
[tokenization] Re-generating scanner
|
2015-06-16 12:52:37 -04:00 |
|
Al
|
77760f207c
|
[tokenization] Adding a Hangul syllable class in tokenization for syllables written out as Jamo
|
2015-06-16 12:52:04 -04:00 |
|
Al
|
f04fad0e93
|
[i18n] Generating Hangul syllable classes
|
2015-06-16 12:50:48 -04:00 |
|
Al
|
cb2035867b
|
[fix] osm geodata imports
|
2015-06-15 18:36:01 -04:00 |
|
Al
|
d2d25ead6f
|
[utils] Adding unicode_csv module
|
2015-06-15 18:06:54 -04:00 |
|
Al
|
651f91fc11
|
[polygons] Adding language exceptions, now including osm relation ids
|
2015-06-15 18:04:44 -04:00 |
|
Al
|
ccb64f7ac2
|
[polygons] Adding address_normalizer polygons package
|
2015-06-15 17:55:27 -04:00 |
|
Al
|
22fa81b33f
|
[fix] __init__.py
|
2015-06-15 17:54:27 -04:00 |
|
Al
|
41dbd97bf2
|
[geodisambig] quattroshapes download can use default or specified location, unzips files
|
2015-06-15 17:54:08 -04:00 |
|
Al
|
037d4575ae
|
[geodisambig] Modifying GeoNames TSV again. Using files again and sorting
|
2015-06-15 17:51:09 -04:00 |
|
Al
|
67bd9f1a31
|
[i18n] Adding languages.py
|
2015-06-15 17:48:47 -04:00 |
|
Al
|
073fe43698
|
[geodisambig] Adding quattroshapes download script
|
2015-06-15 17:46:11 -04:00 |
|
Al
|
73f37fe66b
|
[fix] Moving default Geonames DB path to a shared module
|
2015-06-15 12:53:00 -04:00 |
|
Al
|
7a4fa7d443
|
[geodisambig] Canonical country names from CLDR, adding alpha-2 and alpha-3 surface forms, writing results to stdout or a file for streaming
|
2015-06-15 01:58:43 -04:00 |
|
Al
|
43e023077c
|
[fix] Changing logging to stderr for the Geonames scripts
|
2015-06-14 15:38:57 -04:00 |
|
Al
|
e3dffc177c
|
[fix] gazetteers typo
|
2015-06-12 17:26:14 -04:00 |
|
Al
|
5f5efad6ac
|
[numex] Working numex implemenation. Tested on most languages, Germanic, Latin/whole_tokens_only, English concatenated or with separators, French numerals like quatre-vignt-douze, Spanish multiple-token ordinals, Japanese numerals, etc. All looking good
|
2015-06-12 16:21:36 -04:00 |
|
Al
|
c159f83f9b
|
[fix] trie_search logging
|
2015-06-12 16:17:41 -04:00 |
|
Al
|
a100cd83c9
|
[numex] Re-generated numex data file
|
2015-06-12 16:15:53 -04:00 |
|
Al
|
8520df96c8
|
[utils] utf8 comparison can handle a non-valid UTF-8 sequence e.g. for trie suffix comparison where we may be in the middle of a multi-byte character. Adding a standard utf8_common_prefix method
|
2015-06-12 16:11:40 -04:00 |
|
Al
|
5c2839e534
|
[numx] header and table builder changes to support whole words languages
|
2015-06-12 16:10:57 -04:00 |
|
Al
|
1c4657b631
|
[numex] Setting Latin to whole_words_only
|
2015-06-12 16:10:07 -04:00 |
|
Al
|
fc735bb5c3
|
[numex] Adding a whole words only option on numex languages e.g. for Latin so we don't match an initial D with 500
|
2015-06-12 16:09:45 -04:00 |
|
Al
|
6b60446dbe
|
[phrases] no longer ignoring spaces in the input string, just trying different methods for hyphens, getting indexes right in the case where a space or hyphen precedes the match and backtracking on matches if the rest of the string falls off the trie
|
2015-06-12 11:30:24 -04:00 |
|
Al
|
3442b9ad92
|
[utils] require at least one non-space/non-hyphen match in utf8_common_prefix_len_ignore_separators
|
2015-06-12 11:19:37 -04:00 |
|
Al
|
6841ed8fb3
|
[phrases] Ignoring separators and dashes in trie_search_prefixes so it can be used for languages like German where numbers, phrases, etc. may just be concatenated together as a single token
|
2015-06-11 11:05:56 -04:00 |
|
Al
|
ab5ea6d791
|
[utils] Common prefix-style return value instead of a utf8 strcmp
|
2015-06-11 10:59:51 -04:00 |
|
Al
|
aad5f3edd3
|
[utils] UTF-8 lowercasing and string comparison, including a version which ignores dashes/spaces
|
2015-06-10 18:27:14 -04:00 |
|
Al
|
cb603562e0
|
[phrases] Adding *_from_index methods to trie_search
|
2015-06-09 11:14:42 -04:00 |
|
Al
|
81be8e771e
|
[numex] regen data file. utf8_is_hyphen requires a character, all other methods use category
|
2015-06-08 21:32:38 -04:00 |
|
Al
|
c1d0afa52c
|
[fix] additional French numex
|
2015-06-08 21:30:32 -04:00 |
|
Al
|
c1bed8b410
|
[numex] header changes
|
2015-06-08 21:29:36 -04:00 |
|
Al
|
fd1ebba720
|
[numex] Initial implementation of multilingual numeric expression parser
|
2015-06-08 21:29:04 -04:00 |
|
Al
|
6267b3a431
|
[numex] Adding numex phrase structure to the API
|
2015-06-07 23:56:24 -04:00 |
|
Al
|
06835d5c37
|
[utils] string_utils category functions take a category instead of a codepoint
|
2015-06-06 20:41:07 -04:00 |
|
Al
|
fc250724e1
|
[numex] tercera=>3ra
|
2015-06-06 20:39:57 -04:00 |
|
Al
|
7c613a068f
|
[dictionaries] English dictionary updates
|
2015-06-06 20:39:27 -04:00 |
|
Al
|
2856c2b401
|
[utils] string_utils category functions take a category instead of a codepoint
|
2015-06-05 16:55:21 -04:00 |
|
Al
|
3030dbe4be
|
[fix] transliteration states
|
2015-06-05 00:09:29 -04:00 |
|
Al
|
e32916f3df
|
[fix] closing file in numex table builder
|
2015-06-04 23:59:21 -04:00 |
|
Al
|
b244aa30f2
|
[numex] Setting numex_table to NULL during teardown, adding some logging
|
2015-06-04 23:57:52 -04:00 |
|
Al
|
3bd5172afd
|
[numex] Adding NUMEX_NULL_RULE at the first index
|
2015-06-04 17:21:44 -04:00 |
|
Al
|
3400a59e1c
|
[numex] adding a NUMEX_NULL_RULE
|
2015-06-04 17:21:16 -04:00 |
|
Al
|
95a4bb8e7c
|
[numex] teardown in numex table builder
|
2015-06-04 17:20:26 -04:00 |
|
Al
|
114b728f96
|
[fix] var
|
2015-06-04 17:18:05 -04:00 |
|