[readme] README changes

This commit is contained in:
Al
2015-09-24 23:16:07 -04:00
parent 3e256404b9
commit f62cfb9551

View File

@@ -1,3 +1,4 @@
<pre>
___ __ __ ___ ___ __ __ ___
/\_ \ __/\ \ /\ \__ /\_ \ /\_ \ __/\ \ /\ \__ /\_ \
\//\ \ /\_\ \ \____ _____ ___ ____\ \ ,_\ __ \//\ \ \//\ \ /\_\ \ \____ _____ ___ ____\ \ ,_\ __ \//\ \
@@ -8,6 +9,7 @@
\ \_\ \ \_\
\/_/ \/_/
--------------------------------------------------------------------- ---------------------------------------------------------------------
</pre>
**N.B.**: libpostal is not publicly released yet and the APIs may change. We **N.B.**: libpostal is not publicly released yet and the APIs may change. We
encourage folks to hold off on including it as a dependency for now. encourage folks to hold off on including it as a dependency for now.
@@ -33,9 +35,9 @@ libpostal's raison d'être
------------------------- -------------------------
libpostal was created as part of the [OpenVenues](https://github.com/openvenues/openvenues) project to solve libpostal was created as part of the [OpenVenues](https://github.com/openvenues/openvenues) project to solve
the problem of place deduping. In OpenVenues, we have a data set of millions of the problem of venue deduping. In OpenVenues, we have a data set of millions of
places derived from terabytes of web pages from the [Common Crawl](http://commoncrawl.org/). places derived from terabytes of web pages from the [Common Crawl](http://commoncrawl.org/).
The Common Crawl is published every month, and so even merging the results of The Common Crawl is published monthly, and so even merging the results of
two crawls produces significant duplicates. two crawls produces significant duplicates.
Deduping is a relatively well-studied field, and for text documents like web Deduping is a relatively well-studied field, and for text documents like web
@@ -75,8 +77,8 @@ only in the OpenVenues case, we do it without a UI or a user to select the
correct address in an autocomplete. It's server-side batch geocoding correct address in an autocomplete. It's server-side batch geocoding
(and you can too!) (and you can too!)
Now, instead of giant Elasticsearch synonyms files, etc. Now, instead of fiddling with giant Elasticsearch synonyms files, scripting,
geocoding can look like this: analyzers, tokenizers, and the like, geocoding can look like this:
1. Run the addresses in your index through libpostal 1. Run the addresses in your index through libpostal
2. Store the canonical strings 2. Store the canonical strings
@@ -183,8 +185,8 @@ challenges libpostal can handle:
For further reading and some less intuitive examples of addresses, see For further reading and some less intuitive examples of addresses, see
"[Falsehoods Programmers Believe About Addresses](https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/)". "[Falsehoods Programmers Believe About Addresses](https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/)".
Why C (you crazy person)? Why C (i.e. are you crazy)?
------------------------- ---------------------------
libpostal is written in C for three reasons (in order of importance): libpostal is written in C for three reasons (in order of importance):
@@ -230,7 +232,7 @@ libpostal is written in modern, legible, C99.
- Throughly test for memory leaks before pushing - Throughly test for memory leaks before pushing
- Keep it reasonably cross-platform compatible, particularly for *nix - Keep it reasonably cross-platform compatible, particularly for *nix
Language dictinonaries Language dictionaries
---------------------- ----------------------
It's easy to add new languages/synonyms to libpostal by modifying a few text It's easy to add new languages/synonyms to libpostal by modifying a few text
@@ -287,12 +289,12 @@ In most cases better to leave these alone
Most of the dictionaries have been derived with the following process: Most of the dictionaries have been derived with the following process:
1. Tokenize all the streets in OSM for a particular language 1. Tokenize all the streets in OSM for language x
2. Count the words 2. Count the most common N tokens
3. Optionally use frequent item set mining to get frequent phrases 3. Optionally use frequent item set mining to get frequent phrases
4. Run the most frequent words/phrases through Google Translate 4. Run the most frequent words/phrases through Google Translate
5. Add the ones that mean "street" to dictionaries 5. Add the ones that mean "street" to dictionaries
6. Research thoroughfare types in a given country 6. Augment by researching addresses in countries speaking language x
In the future it might be beneficial to move the dictionaries to a wiki In the future it might be beneficial to move the dictionaries to a wiki
so they can be crowdsourced by native speakers regardless of whether or not so they can be crowdsourced by native speakers regardless of whether or not
@@ -321,6 +323,8 @@ To install via Python, just use:
pip install https://github.com/openvenues/libpostal.git pip install https://github.com/openvenues/libpostal.git
``` ```
**Note**: The Python bindings don't implement libpostal's full API currently.
Command-line usage Command-line usage
------------------ ------------------
@@ -331,7 +335,6 @@ cd src/
./libpostal "12 Three-hundred and forty-fifth ave, ste. no 678" en ./libpostal "12 Three-hundred and forty-fifth ave, ste. no 678" en
#12 345th avenue, suite number 678 #12 345th avenue, suite number 678
``` ```
Currently libpostal requires two input strings, the address text and a language Currently libpostal requires two input strings, the address text and a language