Merge pull request #699 from le0pard/patch-1
Update README.md with new server
This commit is contained in:
31
README.md
31
README.md
@@ -11,7 +11,7 @@ libpostal is a C library for parsing/normalizing street addresses around the wor
|
|||||||
- **Original post**: [Statistical NLP on OpenStreetMap](https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-b9d573e6cc86)
|
- **Original post**: [Statistical NLP on OpenStreetMap](https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-b9d573e6cc86)
|
||||||
- **Follow-up for 1.0 release**: [Statistical NLP on OpenStreetMap: Part 2](https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-part-2-80405b988718)
|
- **Follow-up for 1.0 release**: [Statistical NLP on OpenStreetMap: Part 2](https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-part-2-80405b988718)
|
||||||
|
|
||||||
<span>🇧🇷</span> <span>🇫🇮</span> <span>🇳🇬</span> :jp: <span>🇽🇰 </span> <span>🇧🇩 </span> <span>🇵🇱 </span> <span>🇻🇳 </span> <span>🇧🇪 </span> <span>🇲🇦 </span> <span>🇺🇦 </span> <span>🇯🇲 </span> :ru: <span>🇮🇳 </span> <span>🇱🇻 </span> <span>🇧🇴 </span> :de: <span>🇸🇳 </span> <span>🇦🇲 </span> :kr: <span>🇳🇴 </span> <span>🇲🇽 </span> <span>🇨🇿 </span> <span>🇹🇷 </span> :es: <span>🇸🇸 </span> <span>🇪🇪 </span> <span>🇧🇭 </span> <span>🇳🇱 </span> :cn: <span>🇵🇹 </span> <span>🇵🇷 </span> :gb: <span>🇵🇸 </span>
|
<span>🇧🇷</span> <span>🇫🇮</span> <span>🇳🇬</span> :jp: <span>🇽🇰 </span> <span>🇧🇩 </span> <span>🇵🇱 </span> <span>🇻🇳 </span> <span>🇧🇪 </span> <span>🇲🇦 </span> <span>🇺🇦 </span> <span>🇯🇲 </span> :ru: <span>🇮🇳 </span> <span>🇱🇻 </span> <span>🇧🇴 </span> :de: <span>🇸🇳 </span> <span>🇦🇲 </span> :kr: <span>🇳🇴 </span> <span>🇲🇽 </span> <span>🇨🇿 </span> <span>🇹🇷 </span> :es: <span>🇸🇸 </span> <span>🇪🇪 </span> <span>🇧🇭 </span> <span>🇳🇱 </span> :cn: <span>🇵🇹 </span> <span>🇵🇷 </span> :gb: <span>🇵🇸 </span>
|
||||||
|
|
||||||
Addresses and the locations they represent are essential for any application dealing with maps (place search, transportation, on-demand/delivery services, check-ins, reviews). Yet even the simplest addresses are packed with local conventions, abbreviations and context, making them difficult to index/query effectively with traditional full-text search engines. This library helps convert the free-form addresses that humans use into clean normalized forms suitable for machine comparison and full-text indexing. Though libpostal is not itself a full geocoder, it can be used as a preprocessing step to make any geocoding application smarter, simpler, and more consistent internationally.
|
Addresses and the locations they represent are essential for any application dealing with maps (place search, transportation, on-demand/delivery services, check-ins, reviews). Yet even the simplest addresses are packed with local conventions, abbreviations and context, making them difficult to index/query effectively with traditional full-text search engines. This library helps convert the free-form addresses that humans use into clean normalized forms suitable for machine comparison and full-text indexing. Though libpostal is not itself a full geocoder, it can be used as a preprocessing step to make any geocoding application smarter, simpler, and more consistent internationally.
|
||||||
|
|
||||||
@@ -225,7 +225,7 @@ Examples of parsing
|
|||||||
|
|
||||||
libpostal's international address parser uses machine learning (Conditional Random Fields) and is trained on over 1 billion addresses in every inhabited country on Earth. We use [OpenStreetMap](https://openstreetmap.org) and [OpenAddresses](https://openaddresses.io) as sources of structured addresses, and the OpenCage address format templates at: https://github.com/OpenCageData/address-formatting to construct the training data, supplementing with containing polygons, and generating sub-building components like apartment/floor numbers and PO boxes. We also add abbreviations, drop out components at random, etc. to make the parser as robust as possible to messy real-world input.
|
libpostal's international address parser uses machine learning (Conditional Random Fields) and is trained on over 1 billion addresses in every inhabited country on Earth. We use [OpenStreetMap](https://openstreetmap.org) and [OpenAddresses](https://openaddresses.io) as sources of structured addresses, and the OpenCage address format templates at: https://github.com/OpenCageData/address-formatting to construct the training data, supplementing with containing polygons, and generating sub-building components like apartment/floor numbers and PO boxes. We also add abbreviations, drop out components at random, etc. to make the parser as robust as possible to messy real-world input.
|
||||||
|
|
||||||
These example parse results are taken from the interactive address_parser program
|
These example parse results are taken from the interactive address_parser program
|
||||||
that builds with libpostal when you run ```make```. Note that the parser can handle
|
that builds with libpostal when you run ```make```. Note that the parser can handle
|
||||||
commas vs. no commas as well as various casings and permutations of components (if the input
|
commas vs. no commas as well as various casings and permutations of components (if the input
|
||||||
is e.g. just city or just city/postcode).
|
is e.g. just city or just city/postcode).
|
||||||
@@ -306,14 +306,14 @@ Examples of normalization
|
|||||||
-------------------------
|
-------------------------
|
||||||
|
|
||||||
The expand_address API converts messy real-world addresses into normalized
|
The expand_address API converts messy real-world addresses into normalized
|
||||||
equivalents suitable for search indexing, hashing, etc.
|
equivalents suitable for search indexing, hashing, etc.
|
||||||
|
|
||||||
Here's an interactive example using the Python binding:
|
Here's an interactive example using the Python binding:
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
libpostal contains an OSM-trained language classifier to detect which language(s) are used in a given
|
libpostal contains an OSM-trained language classifier to detect which language(s) are used in a given
|
||||||
address so it can apply the appropriate normalizations. The only input needed is the raw address string.
|
address so it can apply the appropriate normalizations. The only input needed is the raw address string.
|
||||||
Here's a short list of some less straightforward normalizations in various languages.
|
Here's a short list of some less straightforward normalizations in various languages.
|
||||||
|
|
||||||
| Input | Output (may be multiple in libpostal) |
|
| Input | Output (may be multiple in libpostal) |
|
||||||
@@ -437,6 +437,7 @@ Libpostal is designed to be used by higher-level languages. If you don't see yo
|
|||||||
|
|
||||||
**Unofficial servers**
|
**Unofficial servers**
|
||||||
|
|
||||||
|
- Libpostal REST GO Server (need ~4Gb memory) with basic security: [postal_server](https://github.com/le0pard/postal_server)
|
||||||
- Libpostal REST Go Docker: [libpostal-rest-docker](https://github.com/johnlonganecker/libpostal-rest-docker)
|
- Libpostal REST Go Docker: [libpostal-rest-docker](https://github.com/johnlonganecker/libpostal-rest-docker)
|
||||||
- Libpostal REST FastAPI Docker: [libpostal-fastapi](https://github.com/alpha-affinity/libpostal-fastapi)
|
- Libpostal REST FastAPI Docker: [libpostal-fastapi](https://github.com/alpha-affinity/libpostal-fastapi)
|
||||||
- Libpostal ZeroMQ Docker: [libpostal-zeromq](https://github.com/pasupulaphani/libpostal-docker)
|
- Libpostal ZeroMQ Docker: [libpostal-zeromq](https://github.com/pasupulaphani/libpostal-docker)
|
||||||
@@ -460,7 +461,7 @@ Data files
|
|||||||
|
|
||||||
libpostal needs to download some data files from S3. The basic files are on-disk
|
libpostal needs to download some data files from S3. The basic files are on-disk
|
||||||
representations of the data structures necessary to perform expansion. For address
|
representations of the data structures necessary to perform expansion. For address
|
||||||
parsing, since model training takes a few days, we publish the fully trained model
|
parsing, since model training takes a few days, we publish the fully trained model
|
||||||
to S3 and will update it automatically as new addresses get added to OSM, OpenAddresses, etc. Same goes for the language classifier model.
|
to S3 and will update it automatically as new addresses get added to OSM, OpenAddresses, etc. Same goes for the language classifier model.
|
||||||
|
|
||||||
Data files are automatically downloaded when you run make. To check for and download
|
Data files are automatically downloaded when you run make. To check for and download
|
||||||
@@ -511,7 +512,7 @@ optionally be separated so Rosenstraße and Rosen Straße are equivalent.
|
|||||||
- **International address parsing**: [Conditional Random Field](https://web.archive.org/web/20240104172655/http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/) which parses
|
- **International address parsing**: [Conditional Random Field](https://web.archive.org/web/20240104172655/http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/) which parses
|
||||||
"123 Main Street New York New York" into {"house_number": 123, "road":
|
"123 Main Street New York New York" into {"house_number": 123, "road":
|
||||||
"Main Street", "city": "New York", "state": "New York"}. The parser works
|
"Main Street", "city": "New York", "state": "New York"}. The parser works
|
||||||
for a wide variety of countries and languages, not just US/English.
|
for a wide variety of countries and languages, not just US/English.
|
||||||
The model is trained on over 1 billion addresses and address-like strings, using the
|
The model is trained on over 1 billion addresses and address-like strings, using the
|
||||||
templates in the [OpenCage address formatting repo](https://github.com/OpenCageData/address-formatting) to construct formatted,
|
templates in the [OpenCage address formatting repo](https://github.com/OpenCageData/address-formatting) to construct formatted,
|
||||||
tagged training examples for every inhabited country in the world. Many types of [normalizations](https://github.com/openvenues/libpostal/blob/master/scripts/geodata/addresses/components.py)
|
tagged training examples for every inhabited country in the world. Many types of [normalizations](https://github.com/openvenues/libpostal/blob/master/scripts/geodata/addresses/components.py)
|
||||||
@@ -522,13 +523,13 @@ trained (using the [FTRL-Proximal](https://research.google.com/pubs/archive/4115
|
|||||||
addresses. Labels are derived using point-in-polygon tests for both OSM countries
|
addresses. Labels are derived using point-in-polygon tests for both OSM countries
|
||||||
and official/regional languages for countries and admin 1 boundaries
|
and official/regional languages for countries and admin 1 boundaries
|
||||||
respectively. So, for example, Spanish is the default language in Spain but
|
respectively. So, for example, Spanish is the default language in Spain but
|
||||||
in different regions e.g. Catalunya, Galicia, the Basque region, the respective
|
in different regions e.g. Catalunya, Galicia, the Basque region, the respective
|
||||||
regional languages are the default. Dictionary-based disambiguation is employed in
|
regional languages are the default. Dictionary-based disambiguation is employed in
|
||||||
cases where the regional language is non-default e.g. Welsh, Breton, Occitan.
|
cases where the regional language is non-default e.g. Welsh, Breton, Occitan.
|
||||||
The dictionaries are also used to abbreviate canonical phrases like "Calle" => "C/"
|
The dictionaries are also used to abbreviate canonical phrases like "Calle" => "C/"
|
||||||
(performed on both the language classifier and the address parser training sets)
|
(performed on both the language classifier and the address parser training sets)
|
||||||
|
|
||||||
- **Numeric expression parsing** ("twenty first" => 21st,
|
- **Numeric expression parsing** ("twenty first" => 21st,
|
||||||
"quatre-vingt-douze" => 92, again using data provided in CLDR), supports > 30
|
"quatre-vingt-douze" => 92, again using data provided in CLDR), supports > 30
|
||||||
languages. Handles languages with concatenated expressions e.g.
|
languages. Handles languages with concatenated expressions e.g.
|
||||||
milleottocento => 1800. Optionally normalizes Roman numerals regardless of the
|
milleottocento => 1800. Optionally normalizes Roman numerals regardless of the
|
||||||
@@ -543,9 +544,9 @@ strips accent marks e.g. à => a and/or applies Latin-ASCII transliteration.
|
|||||||
|
|
||||||
- **Transliteration**: e.g. улица => ulica or ulitsa. Uses all
|
- **Transliteration**: e.g. улица => ulica or ulitsa. Uses all
|
||||||
[CLDR transforms](http://www.unicode.org/repos/cldr/trunk/common/transforms/), the exact same source data as used by [ICU](http://site.icu-project.org/),
|
[CLDR transforms](http://www.unicode.org/repos/cldr/trunk/common/transforms/), the exact same source data as used by [ICU](http://site.icu-project.org/),
|
||||||
though libpostal doesn't require pulling in all of ICU (might conflict
|
though libpostal doesn't require pulling in all of ICU (might conflict
|
||||||
with your system's version). Note: some languages, particularly Hebrew, Arabic
|
with your system's version). Note: some languages, particularly Hebrew, Arabic
|
||||||
and Thai may not include vowels and thus will not often match a transliteration
|
and Thai may not include vowels and thus will not often match a transliteration
|
||||||
done by a human. It may be possible to implement statistical transliterators
|
done by a human. It may be possible to implement statistical transliterators
|
||||||
for some of these languages.
|
for some of these languages.
|
||||||
|
|
||||||
@@ -570,7 +571,7 @@ places derived from terabytes of web pages from the [Common Crawl](http://common
|
|||||||
The Common Crawl is published monthly, and so even merging the results of
|
The Common Crawl is published monthly, and so even merging the results of
|
||||||
two crawls produces significant duplicates.
|
two crawls produces significant duplicates.
|
||||||
|
|
||||||
Deduping is a relatively well-studied field, and for text documents
|
Deduping is a relatively well-studied field, and for text documents
|
||||||
like web pages, academic papers, etc. there exist pretty decent approximate
|
like web pages, academic papers, etc. there exist pretty decent approximate
|
||||||
similarity methods such as [MinHash](https://en.wikipedia.org/wiki/MinHash).
|
similarity methods such as [MinHash](https://en.wikipedia.org/wiki/MinHash).
|
||||||
|
|
||||||
@@ -603,9 +604,9 @@ So it's not a geocoder?
|
|||||||
-----------------------
|
-----------------------
|
||||||
|
|
||||||
If the above sounds a lot like geocoding, that's because it is in a way,
|
If the above sounds a lot like geocoding, that's because it is in a way,
|
||||||
only in the OpenVenues case, we have to geocode without a UI or a user
|
only in the OpenVenues case, we have to geocode without a UI or a user
|
||||||
to select the correct address in an autocomplete dropdown. Given a database
|
to select the correct address in an autocomplete dropdown. Given a database
|
||||||
of source addresses such as OpenAddresses or OpenStreetMap (or all of the above),
|
of source addresses such as OpenAddresses or OpenStreetMap (or all of the above),
|
||||||
libpostal can be used to implement things like address deduping and server-side
|
libpostal can be used to implement things like address deduping and server-side
|
||||||
batch geocoding in settings like MapReduce or stream processing.
|
batch geocoding in settings like MapReduce or stream processing.
|
||||||
|
|
||||||
@@ -614,7 +615,7 @@ document search engines like Elasticsearch using giant synonyms files, scripting
|
|||||||
custom analyzers, tokenizers, and the like, geocoding can look like this:
|
custom analyzers, tokenizers, and the like, geocoding can look like this:
|
||||||
|
|
||||||
1. Run the addresses in your database through libpostal's expand_address
|
1. Run the addresses in your database through libpostal's expand_address
|
||||||
2. Store the normalized string(s) in your favorite search engine, DB,
|
2. Store the normalized string(s) in your favorite search engine, DB,
|
||||||
hashtable, etc.
|
hashtable, etc.
|
||||||
3. Run your user queries or fresh imports through libpostal and search
|
3. Run your user queries or fresh imports through libpostal and search
|
||||||
the existing database using those strings
|
the existing database using those strings
|
||||||
|
|||||||
Reference in New Issue
Block a user