[docs] README changes, code examples

This commit is contained in:
Al
2016-02-01 17:16:48 -05:00
parent 2100b80f81
commit 6dcc71d87b

136
README.md
View File

@@ -1,9 +1,8 @@
# libpostal
# libpostal: international address parsing and normalization
[![Build Status](https://travis-ci.org/openvenues/libpostal.svg?branch=master)](https://travis-ci.org/openvenues/libpostal)
libpostal is a fast NLP library for parsing and normalizing street addresses
anywhere in the world.
libpostal is a fast statistical parser/normalizer for international street addresses.
:jp: :us: :gb: :ru: :fr: :kr: :it: :es: :cn: :de:
@@ -16,24 +15,28 @@ designed for document indexing. This library helps convert the free-form
addresses that humans use into clean normalized forms suitable for machine
comparison and full-text indexing.
libpostal is not itself a full geocoder, but should be a ubiquitous
preprocessing step before indexing/searching with free text geographic strings.
It is written in C for maximum portability and performance.
While not itself a full geocoder, libpostal can be used as a preprocessing step to make any geocoding application simpler and more consistent internationally.
Bindings for [Python](https://github.com/openvenues/pypostal) and [NodeJS](https://github.com/openvenues/node-postal) are officially supported, and it's easy to write bindings in other languages.
The core library is written in pure C. Language bindings for [Python](https://github.com/openvenues/pypostal) and [NodeJS](https://github.com/openvenues/node-postal) are officially supported and it's easy to write bindings in other languages.
Examples of normalization
-------------------------
Address normalization may sound trivial initially, especially when thinking
only about the US (if that's where you happen to reside), but it only takes
a few examples to realize how complex natural language addresses can get
internationally. Here's a short list of some less straightforward normalizations
in various languages. The left/right columns in this table are equivalent
strings under libpostal, the left column being user input and the right column
being the indexed (normalized) string. Note that libpostal automatically
detects the language(s) used in an address and applies the appropriate expansions.
The only input needed is the raw address string:
The expand_address API converts messy real-world addresses into normalized
equivalents suitable for search indexing, hashing, etc. Here's a code example
using the Python API for succinctness:
```python
from postal.expand import expand_address
expansions = expand_address('Quatre vignt douze Ave des Champs-Élysées')
assert '92 avenue des champs-elysees' in set(expansions)
```
libpostal contains an OSM-trained language classifier to detect which language(s) are used in a given
address so it cna apply the appropriate normalizations. The only input needed is the raw address string.
Here's a short list of some less straightforward normalizations in various languages.
| Input | Output (may be multiple in libpostal) |
| ----------------------------------- |-----------------------------------------|
@@ -46,8 +49,8 @@ The only input needed is the raw address string:
| Marktstrasse 14 | markt straße 14 |
libpostal currently supports these types of normalization in *60+ languages*,
and you can add more (without having to write any C).
and you can [add more](https://github.com/openvenues/libpostal/tree/master/resources/dictionaries)
(without having to write any C).
Now, instead of trying to bake address-specific conventions into traditional
document search engines like Elasticsearch using giant synonyms files, scripting,
@@ -74,12 +77,18 @@ languages. We use OpenStreetMap (anything with an addr:* tag) and the OpenCage
address format templates at: https://github.com/OpenCageData/address-formatting
to construct the training data, supplementing with containing polygons and
perturbing the inputs in a number of ways to make the parser as robust as possible
to messy real-world input.
to messy real-world input. Here's a code example, again using the Python API:
These example parses are taken from the interactive address_parser program
that builds with libpostal on make. Note that the parser doesn't care about commas
vs. no commas, casing, or different permutations of components (if components are
left out e.g. just city or just city/postcode).
```python
from postal.parser import parse_address
parse_address('The Book Club 100-106 Leonard St, Shoreditch, London, Greater London, EC2A 4RH, United Kingdom')
```
These example parse results are taken from the interactive address_parser program
that builds with libpostal when you run make. Note that the parser doesn't care about commas
vs. no commas, casing, or different permutations of components (if the input is e.g. just
a city or just city/postcode).
```
> 781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA
@@ -113,20 +122,21 @@ Result:
"country": "united kingdom"
}
> Eschenbräu Bräurei Triftstraße 67, 13353 Berlin, Deutschland
> Museo del Prado C. de Ruiz de Alarcón, 23 28014 Madrid Madrid, Spain
Result:
{
"house": "eschenbraeu braeurei",
"road": "triftstrasse",
"house_number": "67",
"postcode": "13353",
"city": "berlin",
"country": "deutschland"
"house": "museo del prado",
"road": "c. de ruiz de alarcón",
"house_number": "23",
"postcode": "28014",
"state": "madrid",
"city": "madrid",
"country": "spain"
}
> Double Shot Tea & Coffee 15 Melle St. Braamfontein Johannesburg, 2000, South Africa
> Double Shot Tea & Coffee 15 Melle St. Braamfontein Johannesburg, 2000, South Africa
Result:
@@ -140,19 +150,20 @@ Result:
"country": "south africa"
}
> Le Polikarpov 24 cours Honoré d'Estienne d'Orves, 13001 Marseille, France
> Szimpla Kert Kazinczy utca 14 Budapest 1075, Magyarország
Result:
{
"house": "le polikarpov",
"house_number": "24",
"road": "cours honoré d'estienne d'orves",
"postcode": "13001",
"city": "marseille",
"country": "france"
"house": "szimpla kert",
"road": "kazinczy utca",
"house_number": "14",
"city": "budapest",
"postcode": "1075",
"country": "magyarország"
}
> Государственный Эрмитаж Дворцовая наб., 34 191186, St. Petersburg, Russia
Result:
@@ -218,7 +229,7 @@ After building libpostal:
```
cd src/
./libpostal "12 Three-hundred and forty-fifth ave, ste. no 678" en
./libpostal "Quatre vignt douze Ave des Champs-Élysées"
```
Currently libpostal requires two input strings, the address text and a language
@@ -241,7 +252,7 @@ parse them and print the result.
Tests
-----
libpostal uses [greatest](https://github.com/silentbicycle/greatest) for automated testing. To run the tests, run:
libpostal uses [greatest](https://github.com/silentbicycle/greatest) for automated testing. To run the tests, use:
```
make check
@@ -305,26 +316,6 @@ languages. Handles languages with concatenated expressions e.g.
milleottocento => 1800. Optionally normalizes Roman numerals regardless of the
language (IX => 9) which occur in the names of many monarchs, popes, etc.
- **Geographic name aliasing (coming soon)**: New York, NYC and Nueva York alias
to New York City. Uses the crowd-sourced GeoNames (geonames.org) database, so alternate
names added by contributors can automatically improve libpostal.
- **Geographic disambiguation (coming soon)**: There are several equally
likely Springfields in the US (formally known as The Simpsons problem), and
some context like a state is required to disambiguate. There are also > 1200
distinct San Franciscos in the world but the term "San Francisco" almost always
refers to the one in California. Williamsburg can refer to a neighborhood in
Brooklyn or a city in Virginia. Geo disambiguation is a subset of Word Sense
Disambiguation, and attempts to resolve place names in a string to GeoNames
entities. This can be useful for city-level geocoding suitable for polygon/area
lookup. By default, if there is no other context, as in the San Francisco case,
the most populous entity will be selected.
- **Ambiguous token classification (coming soon)**: e.g. "dr" => "doctor" or
"drive" for an English address depending on the context. Multiclass logistic
regression trained on OSM addresses, where abbreviations are discouraged,
giving us many examples of fully qualified addresses on which to train.
- **Fast, accurate tokenization/lexing**: clocked at > 1M tokens / sec,
implements the TR-29 spec for UTF8 word segmentation, tokenizes East Asian
languages chracter by character instead of on whitespace.
@@ -346,6 +337,29 @@ Latin scripts in the same address). In transliteration we can use all
applicable transliterators for a given Unicode script (Greek can for instance
be transliterated with Greek-Latin, Greek-Latin-BGN and Greek-Latin-UNGEGN).
Roadmap
-------
- **Geographic name aliasing (coming soon)**: New York, NYC and Nueva York alias
to New York City. Uses the crowd-sourced GeoNames (geonames.org) database, so alternate
names added by contributors can automatically improve libpostal.
- **Geographic disambiguation (coming soon)**: There are several equally
likely Springfields in the US (formally known as The Simpsons problem), and
some context like a state is required to disambiguate. There are also > 1200
distinct San Franciscos in the world but the term "San Francisco" almost always
refers to the one in California. Williamsburg can refer to a neighborhood in
Brooklyn or a city in Virginia. Geo disambiguation is a subset of Word Sense
Disambiguation, and attempts to resolve place names in a string to GeoNames
entities. This can be useful for city-level geocoding suitable for polygon/area
lookup. By default, if there is no other context, as in the San Francisco case,
the most populous entity will be selected.
- **Ambiguous token classification (coming soon)**: e.g. "dr" => "doctor" or
"drive" for an English address depending on the context. Multiclass logistic
regression trained on OSM addresses, where abbreviations are discouraged,
giving us many examples of fully qualified addresses on which to train.
Non-goals
---------
@@ -355,7 +369,7 @@ Non-goals
Raison d'être
-------------
libpostal was created as part of the [OpenVenues](https://github.com/openvenues/openvenues) project to solve the problem of venue deduping. In OpenVenues, we have a data set of millions of
libpostal was originally created as part of the [OpenVenues](https://github.com/openvenues/openvenues) project to solve the problem of venue deduping. In OpenVenues, we have a data set of millions of
places derived from terabytes of web pages from the [Common Crawl](http://commoncrawl.org/).
The Common Crawl is published monthly, and so even merging the results of
two crawls produces significant duplicates.