Commit Graph

26 Commits

Author SHA1 Message Date
Al
64f167f045 [tokenization] Re-generating scanner 2016-07-21 17:04:57 -04:00
Al
b5d4dd6f37 [tokenization] Including full-width numbers in numeric tokens 2016-07-21 17:04:57 -04:00
Al
2454b98c6d [tokenization] Reverting commit for tokenizing initial/final apostrophes as part of words as it may be more effective to handle during post-processing 2016-07-21 17:04:57 -04:00
Al
757c6147cb [tokenization] Adding ability to tokenize 's Gravenhage 2016-07-21 17:04:57 -04:00
Al
63d239eef0 [tokenization] Using the new re2c 0.16 generates a 75% smaller DFA for scanner, should speed up compile times on gcc 2016-01-30 02:20:01 -05:00
Al
de240d2b94 [fix] tokenize_add_tokens respects specified length 2016-01-17 20:51:47 -05:00
Al
0eb9ef5bdf [tokenization] Regenerating scanner.c 2015-10-05 01:41:48 -04:00
Al
0aa6950b6c [fix] abbreviations 2015-10-02 23:48:21 -04:00
Al
01856dd36d [fix] acronyms 2015-10-01 00:24:04 -04:00
Al
562aeb497d [tokenization] Regenerating scanner.c 2015-09-30 11:32:38 -04:00
Al
856198a352 [tokenization] Regenerated scanner.c 2015-09-26 02:27:45 -04:00
Al
f13e9fad90 [tokenization] Regenerated scanner.c 2015-09-23 00:33:27 -04:00
Al
71be52275d [tokenization] Adding a version which of tokenize which keeps whitespace tokens 2015-07-21 17:26:20 -04:00
Al
a8b2fb5b90 [tokenization] Regenerating scanner file 2015-07-14 18:16:24 -04:00
Al
3279b31b09 [tokenization] Adding an acronym token type for things like U.N. so we can delete internal periods on those tokens 2015-06-29 03:00:46 -04:00
Al
2b69c185fa [tokenization] Adding a tokenizer method for appending to an existing tokens array (e.g. can stop/start tokenizing on a script change) 2015-06-25 10:03:34 -04:00
Al
880d444881 [tokenization] Re-generating scanner 2015-06-16 12:52:37 -04:00
Al
1b33744956 [tokenization] Numeric tokens must end in number or letter 2015-04-22 14:55:18 -04:00
Al
606a669c01 [tokenization] breaking dashes or double hyphens break a word while other dashes don't 2015-04-17 19:14:42 -04:00
Al
6718182443 [tokenization] non-breaking dashes can be mid-word, em-dashes, etc. break words 2015-04-17 15:21:22 -04:00
Al
79fd7a8ded [tokenization/trie] simpler url regex reduces the scanner file size, accounting for a few more variations in word tokens, making trie suffix search use iteration instead of malloc'ing a new string 2015-04-05 16:33:14 -04:00
Al
2d1c24a6e9 [tokenization] Adding url, email, US/international phone numbers, a separate type for ideographic numbers, more general quotes, paren types 2015-03-24 16:43:53 -04:00
Al
f794ef7222 [tokenization] Exposing some of the scanner's methods in header for use in the Python scanner so it can avoid the additional allocation 2015-03-17 18:38:30 -04:00
Al
a446290829 [fix] IDEOGRAM class name 2015-03-11 17:33:53 -04:00
Al
94805fb1a7 [tokenization] Better scanner support for ideographic languages (Chinese, Japanese, Korean, etc.) with an IDEOGRAM token class in the scanner so we know when we're dealing with those languages vs. other random characters 2015-03-11 17:29:37 -04:00
Al
0689f936c9 [tokenization] scanner/tokenizer (generated with re2c) 2015-03-03 12:35:22 -05:00