[similarity] adding a stopword-aware acronym alignment method for matching U.N. with United Nations, Museum of Modern Art with MoMA, as well as things like University of California - Los Angeles with UCLA. All of these should work across languages, including non-Latin character sets like Cyrllic (but not ideograms as the concept doesn't make as much sense there). Skipping tokens like "of" or "the" depends only on the stopwords dictionary being defined for a given language.
This commit is contained in:
15
src/acronyms.h
Normal file
15
src/acronyms.h
Normal file
@@ -0,0 +1,15 @@
|
||||
#ifndef ACRONYMS_H
|
||||
#define ACRONYMS_H
|
||||
|
||||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
|
||||
#include "address_dictionary.h"
|
||||
#include "collections.h"
|
||||
#include "tokens.h"
|
||||
#include "token_types.h"
|
||||
|
||||
phrase_array *acronym_token_alignments(const char *s1, token_array *tokens1, const char *s2, token_array *tokens2, size_t num_languages, char **languages);
|
||||
|
||||
|
||||
#endif
|
||||
Reference in New Issue
Block a user