Commit Graph

13 Commits

Author SHA1 Message Date
Al
b1e178b7b2 [fix] is_numeric_token includes IDEOGRAPHIC_NUMBER 2017-02-12 15:11:56 -05:00
Al
83381e9d8a [expand] Adding exception for a few types of special punctuation (ampersand, plus, pound sign) which should be left in the original string and separated by whitespace. Closes #84. Closes #85 2016-07-17 15:02:47 -04:00
Al
33e9a05ebf [tokenization] is_whitespace 2016-01-05 16:40:35 -05:00
Al
77ccd975c4 [fix] #endif 2015-12-28 17:03:12 -05:00
Al
e4dba2297d [mv] Moving token type checking to header 2015-12-28 01:17:33 -05:00
Al
aa39c45b87 [tokenization] skipping control characters in tokenization, comes up in OSM surprisingly 2015-10-04 18:25:50 -04:00
Al
9b69d1f67a [fix] Removing C++ checks from all but the main API functions 2015-08-07 17:15:39 -04:00
Al
3279b31b09 [tokenization] Adding an acronym token type for things like U.N. so we can delete internal periods on those tokens 2015-06-29 03:00:46 -04:00
Al
77760f207c [tokenization] Adding a Hangul syllable class in tokenization for syllables written out as Jamo 2015-06-16 12:52:04 -04:00
Al
2d1c24a6e9 [tokenization] Adding url, email, US/international phone numbers, a separate type for ideographic numbers, more general quotes, paren types 2015-03-24 16:43:53 -04:00
Al
a446290829 [fix] IDEOGRAM class name 2015-03-11 17:33:53 -04:00
Al
3ed5795cff [fix] fixing some formatting 2015-03-03 12:54:27 -05:00
Al
5216aba1b6 [utils] string utils, file utils, contiguous arrays of strings used for storing tokenized strings, klib for generic hashtables and vectors, antirez's sds for certain types of string building, utf8proc for iterating over utf-8 strings and unicode normalization 2015-03-03 12:33:13 -05:00