libpostal

Author	SHA1	Message	Date
Al	81b4a4a1cb	[tokenization] Hyphens, etc. between non-ASCII digits (e.g. Unicode full-width numbers) should be single tokens	2016-07-21 17:04:57 -04:00
Al	b5d4dd6f37	[tokenization] Including full-width numbers in numeric tokens	2016-07-21 17:04:57 -04:00
Al	2454b98c6d	[tokenization] Reverting commit for tokenizing initial/final apostrophes as part of words as it may be more effective to handle during post-processing	2016-07-21 17:04:57 -04:00
Al	757c6147cb	[tokenization] Adding ability to tokenize 's Gravenhage	2016-07-21 17:04:57 -04:00
Al	de240d2b94	[fix] tokenize_add_tokens respects specified length	2016-01-17 20:51:47 -05:00
Al	aa39c45b87	[tokenization] skipping control characters in tokenization, comes up in OSM surprisingly	2015-10-04 18:25:50 -04:00
Al	0aa6950b6c	[fix] abbreviations	2015-10-02 23:48:21 -04:00
Al	01856dd36d	[fix] acronyms	2015-10-01 00:24:04 -04:00
Al	689b830ad2	[tokenization] Acronym vs abbreviation	2015-09-30 04:10:04 -04:00
Al	172263af58	[tokenization] Adding updated token classes to scanner.re	2015-09-26 00:05:23 -04:00
Al	b4593b6f88	[unicode/tokenization] Using new character classes including wide chars in scanner	2015-09-23 00:33:14 -04:00
Al	71be52275d	[tokenization] Adding a version which of tokenize which keeps whitespace tokens	2015-07-21 17:26:20 -04:00
Al	43293d0ae3	[tokenization] Fixing a tokenization where mid-number characters appear in the middle of a word+numeric sequence e.g. Zigor,2 should be 3 separate tokens. Sequences like 35,37,39 are still treated as a single token for the moment.	2015-07-14 18:15:58 -04:00
Al	3279b31b09	[tokenization] Adding an acronym token type for things like U.N. so we can delete internal periods on those tokens	2015-06-29 03:00:46 -04:00
Al	2b69c185fa	[tokenization] Adding a tokenizer method for appending to an existing tokens array (e.g. can stop/start tokenizing on a script change)	2015-06-25 10:03:34 -04:00
Al	77760f207c	[tokenization] Adding a Hangul syllable class in tokenization for syllables written out as Jamo	2015-06-16 12:52:04 -04:00
Al	1b33744956	[tokenization] Numeric tokens must end in number or letter	2015-04-22 14:55:18 -04:00
Al	606a669c01	[tokenization] breaking dashes or double hyphens break a word while other dashes don't	2015-04-17 19:14:42 -04:00
Al	6718182443	[tokenization] non-breaking dashes can be mid-word, em-dashes, etc. break words	2015-04-17 15:21:22 -04:00
Al	79fd7a8ded	[tokenization/trie] simpler url regex reduces the scanner file size, accounting for a few more variations in word tokens, making trie suffix search use iteration instead of malloc'ing a new string	2015-04-05 16:33:14 -04:00
Al	2d1c24a6e9	[tokenization] Adding url, email, US/international phone numbers, a separate type for ideographic numbers, more general quotes, paren types	2015-03-24 16:43:53 -04:00
Al	d2ceb5f418	[fix] removing struct definition from scanner.re for future generation of scanner.c	2015-03-17 19:46:40 -04:00
Al	f794ef7222	[tokenization] Exposing some of the scanner's methods in header for use in the Python scanner so it can avoid the additional allocation	2015-03-17 18:38:30 -04:00
Al	a446290829	[fix] IDEOGRAM class name	2015-03-11 17:33:53 -04:00
Al	94805fb1a7	[tokenization] Better scanner support for ideographic languages (Chinese, Japanese, Korean, etc.) with an IDEOGRAM token class in the scanner so we know when we're dealing with those languages vs. other random characters	2015-03-11 17:29:37 -04:00
Al	0689f936c9	[tokenization] scanner/tokenizer (generated with re2c)	2015-03-03 12:35:22 -05:00

26 Commits