[docs] README changes, code examples
This commit is contained in:
136
README.md
136
README.md
@@ -1,9 +1,8 @@
|
|||||||
# libpostal
|
# libpostal: international address parsing and normalization
|
||||||
|
|
||||||
[](https://travis-ci.org/openvenues/libpostal)
|
[](https://travis-ci.org/openvenues/libpostal)
|
||||||
|
|
||||||
libpostal is a fast NLP library for parsing and normalizing street addresses
|
libpostal is a fast statistical parser/normalizer for international street addresses.
|
||||||
anywhere in the world.
|
|
||||||
|
|
||||||
:jp: :us: :gb: :ru: :fr: :kr: :it: :es: :cn: :de:
|
:jp: :us: :gb: :ru: :fr: :kr: :it: :es: :cn: :de:
|
||||||
|
|
||||||
@@ -16,24 +15,28 @@ designed for document indexing. This library helps convert the free-form
|
|||||||
addresses that humans use into clean normalized forms suitable for machine
|
addresses that humans use into clean normalized forms suitable for machine
|
||||||
comparison and full-text indexing.
|
comparison and full-text indexing.
|
||||||
|
|
||||||
libpostal is not itself a full geocoder, but should be a ubiquitous
|
While not itself a full geocoder, libpostal can be used as a preprocessing step to make any geocoding application simpler and more consistent internationally.
|
||||||
preprocessing step before indexing/searching with free text geographic strings.
|
|
||||||
It is written in C for maximum portability and performance.
|
|
||||||
|
|
||||||
Bindings for [Python](https://github.com/openvenues/pypostal) and [NodeJS](https://github.com/openvenues/node-postal) are officially supported, and it's easy to write bindings in other languages.
|
The core library is written in pure C. Language bindings for [Python](https://github.com/openvenues/pypostal) and [NodeJS](https://github.com/openvenues/node-postal) are officially supported and it's easy to write bindings in other languages.
|
||||||
|
|
||||||
Examples of normalization
|
Examples of normalization
|
||||||
-------------------------
|
-------------------------
|
||||||
|
|
||||||
Address normalization may sound trivial initially, especially when thinking
|
The expand_address API converts messy real-world addresses into normalized
|
||||||
only about the US (if that's where you happen to reside), but it only takes
|
equivalents suitable for search indexing, hashing, etc. Here's a code example
|
||||||
a few examples to realize how complex natural language addresses can get
|
using the Python API for succinctness:
|
||||||
internationally. Here's a short list of some less straightforward normalizations
|
|
||||||
in various languages. The left/right columns in this table are equivalent
|
```python
|
||||||
strings under libpostal, the left column being user input and the right column
|
|
||||||
being the indexed (normalized) string. Note that libpostal automatically
|
from postal.expand import expand_address
|
||||||
detects the language(s) used in an address and applies the appropriate expansions.
|
expansions = expand_address('Quatre vignt douze Ave des Champs-Élysées')
|
||||||
The only input needed is the raw address string:
|
|
||||||
|
assert '92 avenue des champs-elysees' in set(expansions)
|
||||||
|
```
|
||||||
|
|
||||||
|
libpostal contains an OSM-trained language classifier to detect which language(s) are used in a given
|
||||||
|
address so it cna apply the appropriate normalizations. The only input needed is the raw address string.
|
||||||
|
Here's a short list of some less straightforward normalizations in various languages.
|
||||||
|
|
||||||
| Input | Output (may be multiple in libpostal) |
|
| Input | Output (may be multiple in libpostal) |
|
||||||
| ----------------------------------- |-----------------------------------------|
|
| ----------------------------------- |-----------------------------------------|
|
||||||
@@ -46,8 +49,8 @@ The only input needed is the raw address string:
|
|||||||
| Marktstrasse 14 | markt straße 14 |
|
| Marktstrasse 14 | markt straße 14 |
|
||||||
|
|
||||||
libpostal currently supports these types of normalization in *60+ languages*,
|
libpostal currently supports these types of normalization in *60+ languages*,
|
||||||
and you can add more (without having to write any C).
|
and you can [add more](https://github.com/openvenues/libpostal/tree/master/resources/dictionaries)
|
||||||
|
(without having to write any C).
|
||||||
|
|
||||||
Now, instead of trying to bake address-specific conventions into traditional
|
Now, instead of trying to bake address-specific conventions into traditional
|
||||||
document search engines like Elasticsearch using giant synonyms files, scripting,
|
document search engines like Elasticsearch using giant synonyms files, scripting,
|
||||||
@@ -74,12 +77,18 @@ languages. We use OpenStreetMap (anything with an addr:* tag) and the OpenCage
|
|||||||
address format templates at: https://github.com/OpenCageData/address-formatting
|
address format templates at: https://github.com/OpenCageData/address-formatting
|
||||||
to construct the training data, supplementing with containing polygons and
|
to construct the training data, supplementing with containing polygons and
|
||||||
perturbing the inputs in a number of ways to make the parser as robust as possible
|
perturbing the inputs in a number of ways to make the parser as robust as possible
|
||||||
to messy real-world input.
|
to messy real-world input. Here's a code example, again using the Python API:
|
||||||
|
|
||||||
These example parses are taken from the interactive address_parser program
|
```python
|
||||||
that builds with libpostal on make. Note that the parser doesn't care about commas
|
|
||||||
vs. no commas, casing, or different permutations of components (if components are
|
from postal.parser import parse_address
|
||||||
left out e.g. just city or just city/postcode).
|
parse_address('The Book Club 100-106 Leonard St, Shoreditch, London, Greater London, EC2A 4RH, United Kingdom')
|
||||||
|
```
|
||||||
|
|
||||||
|
These example parse results are taken from the interactive address_parser program
|
||||||
|
that builds with libpostal when you run make. Note that the parser doesn't care about commas
|
||||||
|
vs. no commas, casing, or different permutations of components (if the input is e.g. just
|
||||||
|
a city or just city/postcode).
|
||||||
|
|
||||||
```
|
```
|
||||||
> 781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA
|
> 781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA
|
||||||
@@ -113,20 +122,21 @@ Result:
|
|||||||
"country": "united kingdom"
|
"country": "united kingdom"
|
||||||
}
|
}
|
||||||
|
|
||||||
> Eschenbräu Bräurei Triftstraße 67, 13353 Berlin, Deutschland
|
> Museo del Prado C. de Ruiz de Alarcón, 23 28014 Madrid Madrid, Spain
|
||||||
|
|
||||||
Result:
|
Result:
|
||||||
|
|
||||||
{
|
{
|
||||||
"house": "eschenbraeu braeurei",
|
"house": "museo del prado",
|
||||||
"road": "triftstrasse",
|
"road": "c. de ruiz de alarcón",
|
||||||
"house_number": "67",
|
"house_number": "23",
|
||||||
"postcode": "13353",
|
"postcode": "28014",
|
||||||
"city": "berlin",
|
"state": "madrid",
|
||||||
"country": "deutschland"
|
"city": "madrid",
|
||||||
|
"country": "spain"
|
||||||
}
|
}
|
||||||
|
|
||||||
> Double Shot Tea & Coffee 15 Melle St. Braamfontein Johannesburg, 2000, South Africa
|
> Double Shot Tea & Coffee 15 Melle St. Braamfontein Johannesburg, 2000, South Africa
|
||||||
|
|
||||||
Result:
|
Result:
|
||||||
|
|
||||||
@@ -140,19 +150,20 @@ Result:
|
|||||||
"country": "south africa"
|
"country": "south africa"
|
||||||
}
|
}
|
||||||
|
|
||||||
> Le Polikarpov 24 cours Honoré d'Estienne d'Orves, 13001 Marseille, France
|
> Szimpla Kert Kazinczy utca 14 Budapest 1075, Magyarország
|
||||||
|
|
||||||
Result:
|
Result:
|
||||||
|
|
||||||
{
|
{
|
||||||
"house": "le polikarpov",
|
"house": "szimpla kert",
|
||||||
"house_number": "24",
|
"road": "kazinczy utca",
|
||||||
"road": "cours honoré d'estienne d'orves",
|
"house_number": "14",
|
||||||
"postcode": "13001",
|
"city": "budapest",
|
||||||
"city": "marseille",
|
"postcode": "1075",
|
||||||
"country": "france"
|
"country": "magyarország"
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
> Государственный Эрмитаж Дворцовая наб., 34 191186, St. Petersburg, Russia
|
> Государственный Эрмитаж Дворцовая наб., 34 191186, St. Petersburg, Russia
|
||||||
|
|
||||||
Result:
|
Result:
|
||||||
@@ -218,7 +229,7 @@ After building libpostal:
|
|||||||
```
|
```
|
||||||
cd src/
|
cd src/
|
||||||
|
|
||||||
./libpostal "12 Three-hundred and forty-fifth ave, ste. no 678" en
|
./libpostal "Quatre vignt douze Ave des Champs-Élysées"
|
||||||
```
|
```
|
||||||
|
|
||||||
Currently libpostal requires two input strings, the address text and a language
|
Currently libpostal requires two input strings, the address text and a language
|
||||||
@@ -241,7 +252,7 @@ parse them and print the result.
|
|||||||
Tests
|
Tests
|
||||||
-----
|
-----
|
||||||
|
|
||||||
libpostal uses [greatest](https://github.com/silentbicycle/greatest) for automated testing. To run the tests, run:
|
libpostal uses [greatest](https://github.com/silentbicycle/greatest) for automated testing. To run the tests, use:
|
||||||
|
|
||||||
```
|
```
|
||||||
make check
|
make check
|
||||||
@@ -305,26 +316,6 @@ languages. Handles languages with concatenated expressions e.g.
|
|||||||
milleottocento => 1800. Optionally normalizes Roman numerals regardless of the
|
milleottocento => 1800. Optionally normalizes Roman numerals regardless of the
|
||||||
language (IX => 9) which occur in the names of many monarchs, popes, etc.
|
language (IX => 9) which occur in the names of many monarchs, popes, etc.
|
||||||
|
|
||||||
- **Geographic name aliasing (coming soon)**: New York, NYC and Nueva York alias
|
|
||||||
to New York City. Uses the crowd-sourced GeoNames (geonames.org) database, so alternate
|
|
||||||
names added by contributors can automatically improve libpostal.
|
|
||||||
|
|
||||||
- **Geographic disambiguation (coming soon)**: There are several equally
|
|
||||||
likely Springfields in the US (formally known as The Simpsons problem), and
|
|
||||||
some context like a state is required to disambiguate. There are also > 1200
|
|
||||||
distinct San Franciscos in the world but the term "San Francisco" almost always
|
|
||||||
refers to the one in California. Williamsburg can refer to a neighborhood in
|
|
||||||
Brooklyn or a city in Virginia. Geo disambiguation is a subset of Word Sense
|
|
||||||
Disambiguation, and attempts to resolve place names in a string to GeoNames
|
|
||||||
entities. This can be useful for city-level geocoding suitable for polygon/area
|
|
||||||
lookup. By default, if there is no other context, as in the San Francisco case,
|
|
||||||
the most populous entity will be selected.
|
|
||||||
|
|
||||||
- **Ambiguous token classification (coming soon)**: e.g. "dr" => "doctor" or
|
|
||||||
"drive" for an English address depending on the context. Multiclass logistic
|
|
||||||
regression trained on OSM addresses, where abbreviations are discouraged,
|
|
||||||
giving us many examples of fully qualified addresses on which to train.
|
|
||||||
|
|
||||||
- **Fast, accurate tokenization/lexing**: clocked at > 1M tokens / sec,
|
- **Fast, accurate tokenization/lexing**: clocked at > 1M tokens / sec,
|
||||||
implements the TR-29 spec for UTF8 word segmentation, tokenizes East Asian
|
implements the TR-29 spec for UTF8 word segmentation, tokenizes East Asian
|
||||||
languages chracter by character instead of on whitespace.
|
languages chracter by character instead of on whitespace.
|
||||||
@@ -346,6 +337,29 @@ Latin scripts in the same address). In transliteration we can use all
|
|||||||
applicable transliterators for a given Unicode script (Greek can for instance
|
applicable transliterators for a given Unicode script (Greek can for instance
|
||||||
be transliterated with Greek-Latin, Greek-Latin-BGN and Greek-Latin-UNGEGN).
|
be transliterated with Greek-Latin, Greek-Latin-BGN and Greek-Latin-UNGEGN).
|
||||||
|
|
||||||
|
Roadmap
|
||||||
|
-------
|
||||||
|
|
||||||
|
- **Geographic name aliasing (coming soon)**: New York, NYC and Nueva York alias
|
||||||
|
to New York City. Uses the crowd-sourced GeoNames (geonames.org) database, so alternate
|
||||||
|
names added by contributors can automatically improve libpostal.
|
||||||
|
|
||||||
|
- **Geographic disambiguation (coming soon)**: There are several equally
|
||||||
|
likely Springfields in the US (formally known as The Simpsons problem), and
|
||||||
|
some context like a state is required to disambiguate. There are also > 1200
|
||||||
|
distinct San Franciscos in the world but the term "San Francisco" almost always
|
||||||
|
refers to the one in California. Williamsburg can refer to a neighborhood in
|
||||||
|
Brooklyn or a city in Virginia. Geo disambiguation is a subset of Word Sense
|
||||||
|
Disambiguation, and attempts to resolve place names in a string to GeoNames
|
||||||
|
entities. This can be useful for city-level geocoding suitable for polygon/area
|
||||||
|
lookup. By default, if there is no other context, as in the San Francisco case,
|
||||||
|
the most populous entity will be selected.
|
||||||
|
|
||||||
|
- **Ambiguous token classification (coming soon)**: e.g. "dr" => "doctor" or
|
||||||
|
"drive" for an English address depending on the context. Multiclass logistic
|
||||||
|
regression trained on OSM addresses, where abbreviations are discouraged,
|
||||||
|
giving us many examples of fully qualified addresses on which to train.
|
||||||
|
|
||||||
Non-goals
|
Non-goals
|
||||||
---------
|
---------
|
||||||
|
|
||||||
@@ -355,7 +369,7 @@ Non-goals
|
|||||||
Raison d'être
|
Raison d'être
|
||||||
-------------
|
-------------
|
||||||
|
|
||||||
libpostal was created as part of the [OpenVenues](https://github.com/openvenues/openvenues) project to solve the problem of venue deduping. In OpenVenues, we have a data set of millions of
|
libpostal was originally created as part of the [OpenVenues](https://github.com/openvenues/openvenues) project to solve the problem of venue deduping. In OpenVenues, we have a data set of millions of
|
||||||
places derived from terabytes of web pages from the [Common Crawl](http://commoncrawl.org/).
|
places derived from terabytes of web pages from the [Common Crawl](http://commoncrawl.org/).
|
||||||
The Common Crawl is published monthly, and so even merging the results of
|
The Common Crawl is published monthly, and so even merging the results of
|
||||||
two crawls produces significant duplicates.
|
two crawls produces significant duplicates.
|
||||||
|
|||||||
Reference in New Issue
Block a user