[dedupe] adding a near-dupe hash for acronyms both with and without stopwords. This will create basic acronyms for institutions like MoMA, UCLA, the NAACP, as well as human initials, etc. It also handles sub-acronyms, so when either at every other non-contiguous stopword (University of Texas at Austin) or punctuation (University of Texas, Austin), it cuts a new sub-acronym (so UT). All of the acronyms for Latin script use a double metaphone as well, so can potentially catch many cases. It does not handle all possible acronyms (e.g. where some of the letters are word-internal as in medical acronyms), but should do relatively well on many common variations.

2018-01-10 22:23:40 -05:00
parent c29557c16b
commit 6ba0403748
3 changed files with 191 additions and 63 deletions
--- a/src/acronyms.h
+++ b/src/acronyms.h
@@ -9,6 +9,8 @@
 #include "tokens.h"
 #include "token_types.h"

+bool stopword_positions(uint32_array *stopwords_array, const char *str, token_array *tokens, size_t num_languages, char **languages);
+
 phrase_array *acronym_token_alignments(const char *s1, token_array *tokens1, const char *s2, token_array *tokens2, size_t num_languages, char **languages);