[readme] README changes
This commit is contained in:
25
README.md
25
README.md
@@ -1,3 +1,4 @@
|
|||||||
|
<pre>
|
||||||
___ __ __ ___
|
___ __ __ ___
|
||||||
/\_ \ __/\ \ /\ \__ /\_ \
|
/\_ \ __/\ \ /\ \__ /\_ \
|
||||||
\//\ \ /\_\ \ \____ _____ ___ ____\ \ ,_\ __ \//\ \
|
\//\ \ /\_\ \ \____ _____ ___ ____\ \ ,_\ __ \//\ \
|
||||||
@@ -8,6 +9,7 @@
|
|||||||
\ \_\
|
\ \_\
|
||||||
\/_/
|
\/_/
|
||||||
---------------------------------------------------------------------
|
---------------------------------------------------------------------
|
||||||
|
</pre>
|
||||||
|
|
||||||
**N.B.**: libpostal is not publicly released yet and the APIs may change. We
|
**N.B.**: libpostal is not publicly released yet and the APIs may change. We
|
||||||
encourage folks to hold off on including it as a dependency for now.
|
encourage folks to hold off on including it as a dependency for now.
|
||||||
@@ -33,9 +35,9 @@ libpostal's raison d'être
|
|||||||
-------------------------
|
-------------------------
|
||||||
|
|
||||||
libpostal was created as part of the [OpenVenues](https://github.com/openvenues/openvenues) project to solve
|
libpostal was created as part of the [OpenVenues](https://github.com/openvenues/openvenues) project to solve
|
||||||
the problem of place deduping. In OpenVenues, we have a data set of millions of
|
the problem of venue deduping. In OpenVenues, we have a data set of millions of
|
||||||
places derived from terabytes of web pages from the [Common Crawl](http://commoncrawl.org/).
|
places derived from terabytes of web pages from the [Common Crawl](http://commoncrawl.org/).
|
||||||
The Common Crawl is published every month, and so even merging the results of
|
The Common Crawl is published monthly, and so even merging the results of
|
||||||
two crawls produces significant duplicates.
|
two crawls produces significant duplicates.
|
||||||
|
|
||||||
Deduping is a relatively well-studied field, and for text documents like web
|
Deduping is a relatively well-studied field, and for text documents like web
|
||||||
@@ -75,8 +77,8 @@ only in the OpenVenues case, we do it without a UI or a user to select the
|
|||||||
correct address in an autocomplete. It's server-side batch geocoding
|
correct address in an autocomplete. It's server-side batch geocoding
|
||||||
(and you can too!)
|
(and you can too!)
|
||||||
|
|
||||||
Now, instead of giant Elasticsearch synonyms files, etc.
|
Now, instead of fiddling with giant Elasticsearch synonyms files, scripting,
|
||||||
geocoding can look like this:
|
analyzers, tokenizers, and the like, geocoding can look like this:
|
||||||
|
|
||||||
1. Run the addresses in your index through libpostal
|
1. Run the addresses in your index through libpostal
|
||||||
2. Store the canonical strings
|
2. Store the canonical strings
|
||||||
@@ -183,8 +185,8 @@ challenges libpostal can handle:
|
|||||||
For further reading and some less intuitive examples of addresses, see
|
For further reading and some less intuitive examples of addresses, see
|
||||||
"[Falsehoods Programmers Believe About Addresses](https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/)".
|
"[Falsehoods Programmers Believe About Addresses](https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/)".
|
||||||
|
|
||||||
Why C (you crazy person)?
|
Why C (i.e. are you crazy)?
|
||||||
-------------------------
|
---------------------------
|
||||||
|
|
||||||
libpostal is written in C for three reasons (in order of importance):
|
libpostal is written in C for three reasons (in order of importance):
|
||||||
|
|
||||||
@@ -230,7 +232,7 @@ libpostal is written in modern, legible, C99.
|
|||||||
- Throughly test for memory leaks before pushing
|
- Throughly test for memory leaks before pushing
|
||||||
- Keep it reasonably cross-platform compatible, particularly for *nix
|
- Keep it reasonably cross-platform compatible, particularly for *nix
|
||||||
|
|
||||||
Language dictinonaries
|
Language dictionaries
|
||||||
----------------------
|
----------------------
|
||||||
|
|
||||||
It's easy to add new languages/synonyms to libpostal by modifying a few text
|
It's easy to add new languages/synonyms to libpostal by modifying a few text
|
||||||
@@ -287,12 +289,12 @@ In most cases better to leave these alone
|
|||||||
|
|
||||||
Most of the dictionaries have been derived with the following process:
|
Most of the dictionaries have been derived with the following process:
|
||||||
|
|
||||||
1. Tokenize all the streets in OSM for a particular language
|
1. Tokenize all the streets in OSM for language x
|
||||||
2. Count the words
|
2. Count the most common N tokens
|
||||||
3. Optionally use frequent item set mining to get frequent phrases
|
3. Optionally use frequent item set mining to get frequent phrases
|
||||||
4. Run the most frequent words/phrases through Google Translate
|
4. Run the most frequent words/phrases through Google Translate
|
||||||
5. Add the ones that mean "street" to dictionaries
|
5. Add the ones that mean "street" to dictionaries
|
||||||
6. Research thoroughfare types in a given country
|
6. Augment by researching addresses in countries speaking language x
|
||||||
|
|
||||||
In the future it might be beneficial to move the dictionaries to a wiki
|
In the future it might be beneficial to move the dictionaries to a wiki
|
||||||
so they can be crowdsourced by native speakers regardless of whether or not
|
so they can be crowdsourced by native speakers regardless of whether or not
|
||||||
@@ -321,6 +323,8 @@ To install via Python, just use:
|
|||||||
pip install https://github.com/openvenues/libpostal.git
|
pip install https://github.com/openvenues/libpostal.git
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Note**: The Python bindings don't implement libpostal's full API currently.
|
||||||
|
|
||||||
Command-line usage
|
Command-line usage
|
||||||
------------------
|
------------------
|
||||||
|
|
||||||
@@ -331,7 +335,6 @@ cd src/
|
|||||||
|
|
||||||
./libpostal "12 Three-hundred and forty-fifth ave, ste. no 678" en
|
./libpostal "12 Three-hundred and forty-fifth ave, ste. no 678" en
|
||||||
#12 345th avenue, suite number 678
|
#12 345th avenue, suite number 678
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Currently libpostal requires two input strings, the address text and a language
|
Currently libpostal requires two input strings, the address text and a language
|
||||||
|
|||||||
Reference in New Issue
Block a user