Al
|
9337bf9aea
|
[phrases] trie_search_suffixes uses the NUL-byte prefix by default but the _from_index version can start from another node. fixing single character suffixes
|
2015-06-25 17:24:19 -04:00 |
|
Al
|
82e85732c4
|
[fix] Setting codepoint in utf8proc_iterate_reversed
|
2015-06-25 17:20:55 -04:00 |
|
Al
|
4fbcb72368
|
[fix] utf8proc option
|
2015-06-25 10:07:37 -04:00 |
|
Al
|
c376bcef3d
|
[utils] get_string_script returns a struct rather than modifying a pointer for the length
|
2015-06-25 10:06:38 -04:00 |
|
Al
|
bcee9832b3
|
[utils] cstring_array_get_token=>cstring_array_get_string
|
2015-06-25 10:05:35 -04:00 |
|
Al
|
2b69c185fa
|
[tokenization] Adding a tokenizer method for appending to an existing tokens array (e.g. can stop/start tokenizing on a script change)
|
2015-06-25 10:03:34 -04:00 |
|
Al
|
581cf406a6
|
[utf8] Adding length argument to string_script function
|
2015-06-24 13:39:09 -05:00 |
|
Al
|
5e71a9d805
|
[utf8] Adding method to get the script of a string and the length of the span (rolls Common script up with the previuos script)
|
2015-06-24 13:29:40 -05:00 |
|
Al
|
85348e1178
|
[fix] enum value conflicted with existing name
|
2015-06-23 15:38:59 -05:00 |
|
Al
|
077e7fd5e2
|
[transliteration] Adding script/language lookups and I/O
|
2015-06-23 15:35:52 -05:00 |
|
Al
|
423d9ca7b7
|
[transliteration] table builder adds script/language rules
|
2015-06-23 15:35:16 -05:00 |
|
Al
|
c3143e5291
|
[transliteration] Adding structs/header stuff for transliterator lookup by script/language
|
2015-06-23 15:34:38 -05:00 |
|
Al
|
8fb6a28e9c
|
[fix] using empty string instead of NULL for script languages so we can use fixed length arrays
|
2015-06-23 15:20:09 -05:00 |
|
Al
|
f2d03a7937
|
[fix] renaming structure
|
2015-06-23 02:12:24 -05:00 |
|
Al
|
7dd772de0f
|
[fix] implementation of cstring_array_split
|
2015-06-23 02:11:24 -05:00 |
|
Al
|
d4cae97fd3
|
[transliteration] regenerated scripts data file
|
2015-06-23 02:10:10 -05:00 |
|
Al
|
b21c3a3a2f
|
[transliteration] using different struct in script data header file
|
2015-06-22 22:06:16 -05:00 |
|
Al
|
2e54ca3575
|
[transliteration] including script data file, adding len to transliterate API for tokenized transliteration
|
2015-06-21 05:42:20 -05:00 |
|
Al
|
79530ae974
|
[transliteration] Adding transliteration script data file
|
2015-06-21 05:39:06 -05:00 |
|
Al
|
c2b4744f55
|
[transliteration] Using a data file instead of a header for transliteration scripts
|
2015-06-21 05:37:56 -05:00 |
|
Al
|
b2e201f297
|
[fix] trailing comma
|
2015-06-20 15:14:41 -05:00 |
|
Al
|
f8bff25948
|
[bloom] bloom filter I/O
|
2015-06-20 12:29:11 -05:00 |
|
Al
|
0ed80c3f6e
|
[geonames] Geonames generic serialization/deserialization
|
2015-06-20 12:00:15 -05:00 |
|
Al
|
d4087be40c
|
[geonames] Pre-escaping tabs, no quoting in geonames/postal code TSVs
|
2015-06-20 11:54:47 -05:00 |
|
Al
|
ab1fb3669f
|
[geonames] Only take alternative names that are != to the canonical name, sort by name, population desc, geonames_id
|
2015-06-19 15:47:50 -05:00 |
|
Al
|
bc306fc6c8
|
[fix] removing unused vars
|
2015-06-18 00:33:03 -04:00 |
|
Al
|
8792c38b52
|
[transliteration] Getting pre-context matching correct for > 1 char contexts, refining pre/post context matching in cases with an empty transition or an empty repeat, falling back to the original character in cases e.g. if there are Latin characters in a Hangul token
|
2015-06-17 23:51:19 -04:00 |
|
Al
|
be8353ad9b
|
[transliteration] Regenerated script data
|
2015-06-17 23:46:29 -04:00 |
|
Al
|
2408cfa6f0
|
[transliteration] Re-generating data file
|
2015-06-17 23:45:56 -04:00 |
|
Al
|
84b9a6ff33
|
[transliteration] Adding Hangul-Latin and Jamo-Latin back into the mix with a restricted filter. Reversing all previous contexts by character group
|
2015-06-17 23:42:31 -04:00 |
|
Al
|
880d444881
|
[tokenization] Re-generating scanner
|
2015-06-16 12:52:37 -04:00 |
|
Al
|
77760f207c
|
[tokenization] Adding a Hangul syllable class in tokenization for syllables written out as Jamo
|
2015-06-16 12:52:04 -04:00 |
|
Al
|
f04fad0e93
|
[i18n] Generating Hangul syllable classes
|
2015-06-16 12:50:48 -04:00 |
|
Al
|
cb2035867b
|
[fix] osm geodata imports
|
2015-06-15 18:36:01 -04:00 |
|
Al
|
d2d25ead6f
|
[utils] Adding unicode_csv module
|
2015-06-15 18:06:54 -04:00 |
|
Al
|
651f91fc11
|
[polygons] Adding language exceptions, now including osm relation ids
|
2015-06-15 18:04:44 -04:00 |
|
Al
|
ccb64f7ac2
|
[polygons] Adding address_normalizer polygons package
|
2015-06-15 17:55:27 -04:00 |
|
Al
|
22fa81b33f
|
[fix] __init__.py
|
2015-06-15 17:54:27 -04:00 |
|
Al
|
41dbd97bf2
|
[geodisambig] quattroshapes download can use default or specified location, unzips files
|
2015-06-15 17:54:08 -04:00 |
|
Al
|
037d4575ae
|
[geodisambig] Modifying GeoNames TSV again. Using files again and sorting
|
2015-06-15 17:51:09 -04:00 |
|
Al
|
67bd9f1a31
|
[i18n] Adding languages.py
|
2015-06-15 17:48:47 -04:00 |
|
Al
|
073fe43698
|
[geodisambig] Adding quattroshapes download script
|
2015-06-15 17:46:11 -04:00 |
|
Al
|
73f37fe66b
|
[fix] Moving default Geonames DB path to a shared module
|
2015-06-15 12:53:00 -04:00 |
|
Al
|
7a4fa7d443
|
[geodisambig] Canonical country names from CLDR, adding alpha-2 and alpha-3 surface forms, writing results to stdout or a file for streaming
|
2015-06-15 01:58:43 -04:00 |
|
Al
|
43e023077c
|
[fix] Changing logging to stderr for the Geonames scripts
|
2015-06-14 15:38:57 -04:00 |
|
Al
|
e3dffc177c
|
[fix] gazetteers typo
|
2015-06-12 17:26:14 -04:00 |
|
Al
|
5f5efad6ac
|
[numex] Working numex implemenation. Tested on most languages, Germanic, Latin/whole_tokens_only, English concatenated or with separators, French numerals like quatre-vignt-douze, Spanish multiple-token ordinals, Japanese numerals, etc. All looking good
|
2015-06-12 16:21:36 -04:00 |
|
Al
|
c159f83f9b
|
[fix] trie_search logging
|
2015-06-12 16:17:41 -04:00 |
|
Al
|
a100cd83c9
|
[numex] Re-generated numex data file
|
2015-06-12 16:15:53 -04:00 |
|
Al
|
8520df96c8
|
[utils] utf8 comparison can handle a non-valid UTF-8 sequence e.g. for trie suffix comparison where we may be in the middle of a multi-byte character. Adding a standard utf8_common_prefix method
|
2015-06-12 16:11:40 -04:00 |
|