[similarity] adding a stopword-aware acronym alignment method for matching U.N. with United Nations, Museum of Modern Art with MoMA, as well as things like University of California - Los Angeles with UCLA. All of these should work across languages, including non-Latin character sets like Cyrllic (but not ideograms as the concept doesn't make as much sense there). Skipping tokens like "of" or "the" depends only on the stopwords dictionary being defined for a given language.
This commit is contained in:
@@ -314,6 +314,12 @@ inline bool utf8_is_hyphen(int32_t ch) {
|
||||
return cat == UTF8PROC_CATEGORY_PD || ch == 0x2212;
|
||||
}
|
||||
|
||||
#define PERIOD_CODEPOINT 46
|
||||
|
||||
inline bool utf8_is_period(int32_t codepoint) {
|
||||
return codepoint == PERIOD_CODEPOINT;
|
||||
}
|
||||
|
||||
inline bool utf8_is_punctuation(int cat) {
|
||||
return cat == UTF8PROC_CATEGORY_PD || cat == UTF8PROC_CATEGORY_PE \
|
||||
|| cat == UTF8PROC_CATEGORY_PF || cat == UTF8PROC_CATEGORY_PI \
|
||||
@@ -703,8 +709,6 @@ ssize_t string_next_codepoint(char *str, uint32_t codepoint) {
|
||||
return string_next_codepoint_len(str, codepoint, strlen(str));
|
||||
}
|
||||
|
||||
#define PERIOD_CODEPOINT 46
|
||||
|
||||
ssize_t string_next_period_len(char *str, size_t len) {
|
||||
return string_next_codepoint_len(str, PERIOD_CODEPOINT, len);
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user