[docs][ci skip] README updates, usage GIFs
This commit is contained in:
249
README.md
249
README.md
@@ -23,44 +23,11 @@ Examples of normalization
|
|||||||
-------------------------
|
-------------------------
|
||||||
|
|
||||||
The expand_address API converts messy real-world addresses into normalized
|
The expand_address API converts messy real-world addresses into normalized
|
||||||
equivalents suitable for search indexing, hashing, etc. The C API is simple:
|
equivalents suitable for search indexing, hashing, etc.
|
||||||
|
|
||||||
```c
|
Here's an interactive example using the Python binding:
|
||||||
#include <stdio.h>
|
|
||||||
#include <stdlib.h>
|
|
||||||
#include <libpostal/libpostal.h>
|
|
||||||
|
|
||||||
int main(int argc, char **argv) {
|

|
||||||
// Setup (only called once at the beginning of your program)
|
|
||||||
if (!libpostal_setup() || !libpostal_setup_language_classifier()) {
|
|
||||||
exit(EXIT_FAILURE);
|
|
||||||
}
|
|
||||||
|
|
||||||
size_t num_expansions;
|
|
||||||
normalize_options_t options = get_libpostal_default_options();
|
|
||||||
char **expansions = expand_address("Quatre vignt douze Ave des Champs-Élysées", options, &num_expansions);
|
|
||||||
|
|
||||||
for (size_t i = 0; i < num_expansions; i++) {
|
|
||||||
printf("%s\n", expansions[i]);
|
|
||||||
}
|
|
||||||
|
|
||||||
// Free expansions
|
|
||||||
expansion_array_destroy(expansions, num_expansions);
|
|
||||||
|
|
||||||
// Teardown (only called once at the end of your program)
|
|
||||||
libpostal_teardown();
|
|
||||||
libpostal_teardown_language_classifier();
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Here's a more succinct example using the Python API:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from postal.expand import expand_address
|
|
||||||
expansions = expand_address('Quatre vignt douze Ave des Champs-Élysées')
|
|
||||||
|
|
||||||
assert '92 avenue des champs-elysees' in set(expansions)
|
|
||||||
```
|
|
||||||
|
|
||||||
libpostal contains an OSM-trained language classifier to detect which language(s) are used in a given
|
libpostal contains an OSM-trained language classifier to detect which language(s) are used in a given
|
||||||
address so it can apply the appropriate normalizations. The only input needed is the raw address string.
|
address so it can apply the appropriate normalizations. The only input needed is the raw address string.
|
||||||
@@ -80,22 +47,51 @@ libpostal currently supports these types of normalizations in *60+ languages*,
|
|||||||
and you can [add more](https://github.com/openvenues/libpostal/tree/master/resources/dictionaries)
|
and you can [add more](https://github.com/openvenues/libpostal/tree/master/resources/dictionaries)
|
||||||
(without having to write any C).
|
(without having to write any C).
|
||||||
|
|
||||||
Now, instead of trying to bake address-specific conventions into traditional
|
|
||||||
document search engines like Elasticsearch using giant synonyms files, scripting,
|
|
||||||
custom analyzers, tokenizers, and the like, geocoding can look like this:
|
|
||||||
|
|
||||||
1. Run the addresses in your database through libpostal's expand_address
|
|
||||||
2. Store the normalized string(s) in your favorite search engine, DB,
|
|
||||||
hashtable, etc.
|
|
||||||
3. Run your user queries or fresh imports through libpostal and search
|
|
||||||
the existing database using those strings
|
|
||||||
|
|
||||||
In this way, libpostal can perform fuzzy address matching in constant time
|
|
||||||
relative to the size of the data set.
|
|
||||||
|
|
||||||
For further reading and some bizarre address edge-cases, see:
|
For further reading and some bizarre address edge-cases, see:
|
||||||
[Falsehoods Programmers Believe About Addresses](https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/).
|
[Falsehoods Programmers Believe About Addresses](https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/).
|
||||||
|
|
||||||
|
Usage (normalization)
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
Here's an example using the Python bindings for succinctness (most of the higher-level language bindings are similar):
|
||||||
|
|
||||||
|
```python
|
||||||
|
from postal.expand import expand_address
|
||||||
|
expansions = expand_address('Quatre-vignt-douze Ave des Champs-Élysées')
|
||||||
|
|
||||||
|
assert '92 avenue des champs-elysees' in set(expansions)
|
||||||
|
```
|
||||||
|
|
||||||
|
The C API equivalent is a few more lines, but still fairly simple:
|
||||||
|
|
||||||
|
```c
|
||||||
|
#include <stdio.h>
|
||||||
|
#include <stdlib.h>
|
||||||
|
#include <libpostal/libpostal.h>
|
||||||
|
|
||||||
|
int main(int argc, char **argv) {
|
||||||
|
// Setup (only called once at the beginning of your program)
|
||||||
|
if (!libpostal_setup() || !libpostal_setup_language_classifier()) {
|
||||||
|
exit(EXIT_FAILURE);
|
||||||
|
}
|
||||||
|
|
||||||
|
size_t num_expansions;
|
||||||
|
normalize_options_t options = get_libpostal_default_options();
|
||||||
|
char **expansions = expand_address("Quatre-vignt-douze Ave des Champs-Élysées", options, &num_expansions);
|
||||||
|
|
||||||
|
for (size_t i = 0; i < num_expansions; i++) {
|
||||||
|
printf("%s\n", expansions[i]);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Free expansions
|
||||||
|
expansion_array_destroy(expansions, num_expansions);
|
||||||
|
|
||||||
|
// Teardown (only called once at the end of your program)
|
||||||
|
libpostal_teardown();
|
||||||
|
libpostal_teardown_language_classifier();
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
Examples of parsing
|
Examples of parsing
|
||||||
-------------------
|
-------------------
|
||||||
|
|
||||||
@@ -105,7 +101,23 @@ languages. We use OpenStreetMap (anything with an addr:* tag) and the OpenCage
|
|||||||
address format templates at: https://github.com/OpenCageData/address-formatting
|
address format templates at: https://github.com/OpenCageData/address-formatting
|
||||||
to construct the training data, supplementing with containing polygons and
|
to construct the training data, supplementing with containing polygons and
|
||||||
perturbing the inputs in a number of ways to make the parser as robust as possible
|
perturbing the inputs in a number of ways to make the parser as robust as possible
|
||||||
to messy real-world input. Here's a C example:
|
to messy real-world input.
|
||||||
|
|
||||||
|
These example parse results are taken from the interactive address_parser program
|
||||||
|
that builds with libpostal when you run ```make```. Note that the parser is robust to
|
||||||
|
commas vs. no commas, casing, different permutations of components (if the input
|
||||||
|
is e.g. just city or just city/postcode).
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
The parser achieves very high accuracy on held-out data, currently 98.9%
|
||||||
|
correct full parses (meaning a 1 in the numerator for getting *every* token
|
||||||
|
in the address correct).
|
||||||
|
|
||||||
|
Usage (parse_address)
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
Here's a C example of the parser API:
|
||||||
|
|
||||||
```c
|
```c
|
||||||
#include <stdio.h>
|
#include <stdio.h>
|
||||||
@@ -134,7 +146,7 @@ int main(int argc, char **argv) {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
And the Python API version:
|
And the equivalent using the Python bindings:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
|
||||||
@@ -142,115 +154,6 @@ from postal.parser import parse_address
|
|||||||
parse_address('The Book Club 100-106 Leonard St Shoreditch London EC2A 4RH, United Kingdom')
|
parse_address('The Book Club 100-106 Leonard St Shoreditch London EC2A 4RH, United Kingdom')
|
||||||
```
|
```
|
||||||
|
|
||||||
These example parse results are taken from the interactive address_parser program
|
|
||||||
that builds with libpostal when you run make. Note that the parser doesn't care about commas
|
|
||||||
vs. no commas, casing, or different permutations of components (if the input is e.g. just
|
|
||||||
a city or just city/postcode).
|
|
||||||
|
|
||||||
```
|
|
||||||
> 781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA
|
|
||||||
|
|
||||||
Result:
|
|
||||||
|
|
||||||
{
|
|
||||||
"house_number": "781",
|
|
||||||
"road": "franklin ave",
|
|
||||||
"suburb": "crown heights",
|
|
||||||
"city_district": "brooklyn",
|
|
||||||
"city": "nyc",
|
|
||||||
"state": "ny",
|
|
||||||
"postcode": "11216",
|
|
||||||
"country": "usa"
|
|
||||||
}
|
|
||||||
|
|
||||||
> The Book Club 100-106 Leonard St, Shoreditch, London, Greater London, England, EC2A 4RH, United Kingdom
|
|
||||||
|
|
||||||
Result:
|
|
||||||
|
|
||||||
{
|
|
||||||
"house": "the book club",
|
|
||||||
"house_number": "100-106",
|
|
||||||
"road": "leonard st",
|
|
||||||
"suburb": "shoreditch",
|
|
||||||
"city": "london",
|
|
||||||
"state_district": "greater london",
|
|
||||||
"state": "england",
|
|
||||||
"postcode": "ec2a 4rh",
|
|
||||||
"country": "united kingdom"
|
|
||||||
}
|
|
||||||
|
|
||||||
> Museo del Prado C. de Ruiz de Alarcón, 23 28014 Madrid Madrid, Spain
|
|
||||||
|
|
||||||
Result:
|
|
||||||
|
|
||||||
{
|
|
||||||
"house": "museo del prado",
|
|
||||||
"road": "c. de ruiz de alarcón",
|
|
||||||
"house_number": "23",
|
|
||||||
"postcode": "28014",
|
|
||||||
"state": "madrid",
|
|
||||||
"city": "madrid",
|
|
||||||
"country": "spain"
|
|
||||||
}
|
|
||||||
|
|
||||||
> Double Shot Tea & Coffee 15 Melle St. Braamfontein Johannesburg, 2000, South Africa
|
|
||||||
|
|
||||||
Result:
|
|
||||||
|
|
||||||
{
|
|
||||||
"house": "double shot tea & coffee",
|
|
||||||
"house_number": "15",
|
|
||||||
"road": "melle st.",
|
|
||||||
"suburb": "braamfontein",
|
|
||||||
"city": "johannesburg",
|
|
||||||
"postcode": "2000",
|
|
||||||
"country": "south africa"
|
|
||||||
}
|
|
||||||
|
|
||||||
> Eschenbräu Bräurei Triftstraße 67, 13353 Berlin, Deutschland
|
|
||||||
|
|
||||||
Result:
|
|
||||||
|
|
||||||
{
|
|
||||||
"house": "eschenbraeu braeurei",
|
|
||||||
"road": "triftstrasse",
|
|
||||||
"house_number": "67",
|
|
||||||
"postcode": "13353",
|
|
||||||
"city": "berlin",
|
|
||||||
"country": "deutschland"
|
|
||||||
}
|
|
||||||
|
|
||||||
> Szimpla Kert Kazinczy utca 14 Budapest 1075, Magyarország
|
|
||||||
|
|
||||||
Result:
|
|
||||||
|
|
||||||
{
|
|
||||||
"house": "szimpla kert",
|
|
||||||
"road": "kazinczy utca",
|
|
||||||
"house_number": "14",
|
|
||||||
"city": "budapest",
|
|
||||||
"postcode": "1075",
|
|
||||||
"country": "magyarország"
|
|
||||||
}
|
|
||||||
|
|
||||||
> Государственный Эрмитаж Дворцовая наб., 34 191186, St. Petersburg, Russia
|
|
||||||
|
|
||||||
Result:
|
|
||||||
|
|
||||||
{
|
|
||||||
"house": "государственный эрмитаж",
|
|
||||||
"road": "дворцовая наб.",
|
|
||||||
"house_number": "34",
|
|
||||||
"postcode": "191186",
|
|
||||||
"city": "st. petersburg",
|
|
||||||
"country": "russia"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
The parser achieves very high accuracy on held-out data, currently 98.9%
|
|
||||||
correct full parses (meaning a 1 in the numerator for getting *every* token
|
|
||||||
in the address correct).
|
|
||||||
|
|
||||||
Installation
|
Installation
|
||||||
------------
|
------------
|
||||||
|
|
||||||
@@ -502,6 +405,19 @@ of source addresses such as OpenAddresses or OpenStreetMap (or all of the above)
|
|||||||
libpostal can be used to implement things like address deduping and server-side
|
libpostal can be used to implement things like address deduping and server-side
|
||||||
batch geocoding in settings like MapReduce or stream processing.
|
batch geocoding in settings like MapReduce or stream processing.
|
||||||
|
|
||||||
|
Now, instead of trying to bake address-specific conventions into traditional
|
||||||
|
document search engines like Elasticsearch using giant synonyms files, scripting,
|
||||||
|
custom analyzers, tokenizers, and the like, geocoding can look like this:
|
||||||
|
|
||||||
|
1. Run the addresses in your database through libpostal's expand_address
|
||||||
|
2. Store the normalized string(s) in your favorite search engine, DB,
|
||||||
|
hashtable, etc.
|
||||||
|
3. Run your user queries or fresh imports through libpostal and search
|
||||||
|
the existing database using those strings
|
||||||
|
|
||||||
|
In this way, libpostal can perform fuzzy address matching in constant time
|
||||||
|
relative to the size of the data set.
|
||||||
|
|
||||||
Why C?
|
Why C?
|
||||||
------
|
------
|
||||||
|
|
||||||
@@ -649,7 +565,7 @@ be a better measure than simply looking at whether each token was correct.
|
|||||||
Improving the address parser
|
Improving the address parser
|
||||||
----------------------------
|
----------------------------
|
||||||
|
|
||||||
Though the current parser is quite good for most standard addresses, there
|
Though the current parser works quite well for most standard addresses, there
|
||||||
is still room for improvement, particularly in making sure the training data
|
is still room for improvement, particularly in making sure the training data
|
||||||
we use is as close as possible to addresses in the wild. There are four primary
|
we use is as close as possible to addresses in the wild. There are four primary
|
||||||
ways the address parser can be improved even further (in order of difficulty):
|
ways the address parser can be improved even further (in order of difficulty):
|
||||||
@@ -676,7 +592,12 @@ ways the address parser can be improved even further (in order of difficulty):
|
|||||||
label sequence not just the token. This may slow down training significantly
|
label sequence not just the token. This may slow down training significantly
|
||||||
although runtime performance would be relatively unaffected.
|
although runtime performance would be relatively unaffected.
|
||||||
|
|
||||||
Todos
|
Contributing
|
||||||
-----
|
------------
|
||||||
|
|
||||||
- [ ] Hosted documentation
|
Bug reports and pull requests are welcome on GitHub at https://github.com/openvenues/libpostal.
|
||||||
|
|
||||||
|
License
|
||||||
|
-------
|
||||||
|
|
||||||
|
The software is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
|
||||||
|
|||||||
Reference in New Issue
Block a user