[fix] phrasing
This commit is contained in:
16
README.md
16
README.md
@@ -41,7 +41,7 @@ The Common Crawl is published monthly, and so even merging the results of
|
|||||||
two crawls produces significant duplicates.
|
two crawls produces significant duplicates.
|
||||||
|
|
||||||
Deduping is a relatively well-studied field, and for text documents like web
|
Deduping is a relatively well-studied field, and for text documents like web
|
||||||
pages, academic papers, etc. we've arrived at pretty decent approximate
|
pages, academic papers, etc. there exist pretty decent approximate
|
||||||
similarity methods such as [MinHash](https://en.wikipedia.org/wiki/MinHash).
|
similarity methods such as [MinHash](https://en.wikipedia.org/wiki/MinHash).
|
||||||
|
|
||||||
However, for physical addresses, the frequent use of conventional abbreviations
|
However, for physical addresses, the frequent use of conventional abbreviations
|
||||||
@@ -72,9 +72,9 @@ actually match and make automated server-side decisions about dupes.
|
|||||||
Isn't that geocoding?
|
Isn't that geocoding?
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
If the above sounds a lot like geocoding, that's because it's very similar,
|
If the above sounds a lot like geocoding, that's because it is in a way,
|
||||||
only in the OpenVenues case, we do it without a UI or a user to select the
|
only in the OpenVenues case, we do it without a UI or a user to select the
|
||||||
correct address in an autocomplete. It's server-side batch geocoding
|
correct address in an autocomplete. libpostal does server-side batch geocoding
|
||||||
(and you can too!)
|
(and you can too!)
|
||||||
|
|
||||||
Now, instead of fiddling with giant Elasticsearch synonyms files, scripting,
|
Now, instead of fiddling with giant Elasticsearch synonyms files, scripting,
|
||||||
@@ -142,12 +142,12 @@ implements the TR-29 spec for UTF8 word segmentation, tokenizes East Asian
|
|||||||
languages chracter by character instead of on whitespace.
|
languages chracter by character instead of on whitespace.
|
||||||
|
|
||||||
- **UTF8 normalization**: optionally decompose UTF8 to NFD normalization form,
|
- **UTF8 normalization**: optionally decompose UTF8 to NFD normalization form,
|
||||||
strips accent marks e.g. à => a and/or apply Latin-ASCII transliteration.
|
strips accent marks e.g. à => a and/or applies Latin-ASCII transliteration.
|
||||||
|
|
||||||
- **Transliteration**: e.g. улица => ulica or ulitsa. Uses all
|
- **Transliteration**: e.g. улица => ulica or ulitsa. Uses all
|
||||||
[CLDR transforms][http://www.unicode.org/repos/cldr/trunk/common/transforms/], which is what ICU uses,
|
[CLDR transforms](http://www.unicode.org/repos/cldr/trunk/common/transforms/), the exact same as used by ICU,
|
||||||
but libpostal doesn't require pulling in all of ICU (possibly conflicting with
|
though libpostal doesn't require pulling in all of ICU (might conflict
|
||||||
your system's version). Note: some languages, particularly Hebrew, Arabic
|
with your system's version). Note: some languages, particularly Hebrew, Arabic
|
||||||
and Thai may not include vowels andthus will not often match a transliteration
|
and Thai may not include vowels andthus will not often match a transliteration
|
||||||
done by a human. It may be possible to implement statistical transliterators
|
done by a human. It may be possible to implement statistical transliterators
|
||||||
for some of these languages.
|
for some of these languages.
|
||||||
@@ -223,7 +223,7 @@ Design philosophy
|
|||||||
|
|
||||||
libpostal is written in modern, legible, C99.
|
libpostal is written in modern, legible, C99.
|
||||||
|
|
||||||
- Keep it object-oriented(-ish)
|
- Keep it roughly object-oriented, as allowed by C
|
||||||
- Confine almost all mallocs to *name*_new and all frees to *name*_destroy
|
- Confine almost all mallocs to *name*_new and all frees to *name*_destroy
|
||||||
- Don't write custom hashtables, sorting algorithms, other undergrad CS stuff
|
- Don't write custom hashtables, sorting algorithms, other undergrad CS stuff
|
||||||
- Use generic containers from klib where possible
|
- Use generic containers from klib where possible
|
||||||
|
|||||||
Reference in New Issue
Block a user