352 lines
17 KiB
Markdown
352 lines
17 KiB
Markdown
<pre>
|
|
___ __ __ ___
|
|
/\_ \ __/\ \ /\ \__ /\_ \
|
|
\//\ \ /\_\ \ \____ _____ ___ ____\ \ ,_\ __ \//\ \
|
|
\ \ \ \/\ \ \ '__`\/\ '__`\ / __`\ /',__\\ \ \/ /'__`\ \ \ \
|
|
\_\ \_\ \ \ \ \L\ \ \ \L\ \/\ \L\ \/\__, `\\ \ \_/\ \L\.\_ \_\ \_
|
|
/\____\\ \_\ \_,__/\ \ ,__/\ \____/\/\____/ \ \__\ \__/.\_\/\____\
|
|
\/____/ \/_/\/___/ \ \ \/ \/___/ \/___/ \/__/\/__/\/_/\/____/
|
|
\ \_\
|
|
\/_/
|
|
---------------------------------------------------------------------
|
|
</pre>
|
|
|
|
**N.B.**: libpostal is not publicly released yet and the APIs may change. We
|
|
encourage folks to hold off on including it as a dependency for now.
|
|
Stay tuned...
|
|
|
|
libpostal is a fast, multilingual, all-i18n-everything NLP library for
|
|
normalizing and parsing physical addresses.
|
|
|
|
Addresses and the geographic coordinates they represent are essential for any
|
|
location-based application (map search, transportation, on-demand/delivery
|
|
services, check-ins, reviews). Yet even the simplest addresses are packed with
|
|
local conventions, abbreviations and context, making them difficult to
|
|
index/query effectively with traditional full-text search engines, which are
|
|
designed for document indexing. This library helps convert the free-form
|
|
addresses that humans use into clean normalized forms suitable for machine
|
|
comparison and full-text indexing.
|
|
|
|
libpostal is not itself a full geocoder, but should be a ubiquitous
|
|
preprocessing step before indexing/searching with free text geographic strings.
|
|
It is written in C for maximum portability and performance.
|
|
|
|
Raison d'être
|
|
-------------
|
|
|
|
libpostal was created as part of the [OpenVenues](https://github.com/openvenues/openvenues) project to solve
|
|
the problem of venue deduping. In OpenVenues, we have a data set of millions of
|
|
places derived from terabytes of web pages from the [Common Crawl](http://commoncrawl.org/).
|
|
The Common Crawl is published monthly, and so even merging the results of
|
|
two crawls produces significant duplicates.
|
|
|
|
Deduping is a relatively well-studied field, and for text documents like web
|
|
pages, academic papers, etc. there exist pretty decent approximate
|
|
similarity methods such as [MinHash](https://en.wikipedia.org/wiki/MinHash).
|
|
|
|
However, for physical addresses, the frequent use of conventional abbreviations
|
|
such as Road == Rd, California == CA, or New York City == NYC complicates
|
|
matters a bit. Even using a technique like MinHash, which is well suited for
|
|
approximate matches and is equivalent to the Jaccard similarity of two sets, we
|
|
have to work with very short texts and it's often the case that two equivalent
|
|
addresses, one abbreviated and one fully specified, will not match very closely
|
|
in terms of n-gram set overlap. In non-Latin scripts, say a Russian address and
|
|
its transliterated equivalent, it's conceivable that two addresses referring to
|
|
the same place may not match even a single character.
|
|
|
|
As a motivating example, consider the following two equivalent ways to write a
|
|
particular Manhattan street address with varying conventions and degrees
|
|
of verbosity:
|
|
|
|
- 30 W 26th St Fl #7
|
|
- 30 West Twenty-sixth Street Floor Number 7
|
|
|
|
Obviously '30 W 26th St Fl #7 != '30 West Twenty-sixth Street Floor Number 7'
|
|
in a string comparison sense, but a human can grok that these two addresses
|
|
refer to the same physical location.
|
|
|
|
libpostal aims to create normalized geographic strings, parsed into components,
|
|
such that we can more effectively reason about how well two addresses
|
|
actually match and make automated server-side decisions about dupes.
|
|
|
|
Isn't that geocoding?
|
|
---------------------
|
|
|
|
If the above sounds a lot like geocoding, that's because it is in a way,
|
|
only in the OpenVenues case, we do it without a UI or a user to select the
|
|
correct address in an autocomplete. libpostal does server-side batch geocoding
|
|
(and you can too!)
|
|
|
|
Now, instead of fiddling with giant Elasticsearch synonyms files, scripting,
|
|
analyzers, tokenizers, and the like, geocoding can look like this:
|
|
|
|
1. Run the addresses in your index through libpostal
|
|
2. Store the canonical strings
|
|
3. Run your user queries through libpostal and search with those strings
|
|
|
|
Features
|
|
--------
|
|
|
|
- **Abbreviation expansion**: e.g. expanding "rd" => "road" but for almost any
|
|
language. libpostal supports > 50 languages and it's easy to add new languages
|
|
or expand the current dictionaries. Ideographic languages (not separated by
|
|
whitespace e.g. Chinese) are supported, as are Germanic languages where
|
|
thoroughfare types are concatenated onto the end of the string, and may
|
|
optionally be separated so Rosenstraße and Rosen Straße are equivalent.
|
|
|
|
- **International address parsing (coming soon)**: sequence model which parses
|
|
"123 Main Street New York New York" into {"house_number": 123, "road":
|
|
"Main Street", "city": "New York", "region": "New York"}. Unlike the majority
|
|
of parsers out there, it works for a wide variety of countries and languages,
|
|
not just US/English. The model is trained on > 40M OSM addresses, using the
|
|
templates in the [OpenCage address formatting repo](https://github.com/OpenCageData/address-formatting) to construct formatted,
|
|
tagged traning examples for most countries around the world.
|
|
|
|
- **Language classification (coming soon)**: multinomial logistic regression
|
|
trained on all of OpenStreetMap ways, addr:* tags, toponyms and formatted
|
|
addresses. Labels are derived using point-in-polygon tests in Quattroshapes
|
|
and official/regional languages for countries and admin 1 boundaries
|
|
respectively. So, for example, Spanish is the default language in Spain but
|
|
in different regions e.g. Catalunya, Galicia, the Basque region, regional
|
|
languages are the default. Dictionary-based disambiguation is employed in
|
|
cases where the regional language is non-default e.g. Welsh, Breton, Occitan.
|
|
|
|
- **Numeric expression parsing** ("twenty first" => 21st,
|
|
"quatre-vignt-douze" => 92, again using data provided in CLDR), supports > 30
|
|
languages. Handles languages with concatenated expressions e.g.
|
|
milleottocento => 1800. Optionally normalizes Roman numerals regardless of the
|
|
language (IX => 9) which occur in the names of many monarchs, popes, etc.
|
|
|
|
- **Geographic name aliasing**: New York, NYC and Nueva York alias to New York
|
|
City. Uses the crowd-sourced GeoNames (geonames.org) database, so alternate
|
|
names added by contributors can automatically improve libpostal.
|
|
|
|
- **Geographic disambiguation (coming soon)**: There are several equally
|
|
likely Springfields in the US (formally known as The Simpsons problem), and
|
|
some context like a state is required to disambiguate. There are also > 1200
|
|
distinct San Franciscos in the world but the term "San Francisco" almost always
|
|
refers to the one in California. Williamsburg can refer to a neighborhood in
|
|
Brooklyn or a city in Virginia. Geo disambiguation is a subset of Word Sense
|
|
Disambiguation, and attempts to resolve place names in a string to GeoNames
|
|
entities. This can be useful for city-level geocoding suitable for polygon/area
|
|
lookup. By default, if there is no other context, as in the San Francisco case,
|
|
the most populous entity will be selected.
|
|
|
|
- **Ambiguous token classification (coming soon)**: e.g. "dr" => "doctor" or
|
|
"drive" for an English address depending on the context. Multiclass logistic
|
|
regression trained on OSM addresses, where abbreviations are discouraged,
|
|
giving us many examples of fully qualified addresses on which to train.
|
|
|
|
- **Fast, accurate tokenization/lexing**: clocked at > 1M tokens / sec,
|
|
implements the TR-29 spec for UTF8 word segmentation, tokenizes East Asian
|
|
languages chracter by character instead of on whitespace.
|
|
|
|
- **UTF8 normalization**: optionally decompose UTF8 to NFD normalization form,
|
|
strips accent marks e.g. à => a and/or applies Latin-ASCII transliteration.
|
|
|
|
- **Transliteration**: e.g. улица => ulica or ulitsa. Uses all
|
|
[CLDR transforms](http://www.unicode.org/repos/cldr/trunk/common/transforms/), the exact same as used by ICU,
|
|
though libpostal doesn't require pulling in all of ICU (might conflict
|
|
with your system's version). Note: some languages, particularly Hebrew, Arabic
|
|
and Thai may not include vowels andthus will not often match a transliteration
|
|
done by a human. It may be possible to implement statistical transliterators
|
|
for some of these languages.
|
|
|
|
- **Script detection**: Detects which script a given string uses (can be
|
|
multiple e.g. a free-form Hong Kong or Macau address may use both Han and
|
|
Latin scripts in the same address). In transliteration we can use all
|
|
applicable transliterators for a given Unicode script (Greek can for instance
|
|
be transliterated with Greek-Latin, Greek-Latin-BGN and Greek-Latin-UNGEGN).
|
|
|
|
Non-goals
|
|
---------
|
|
|
|
- Verifying that a location is a valid address
|
|
- Street-level geocoding
|
|
|
|
Examples of expansion
|
|
---------------------
|
|
|
|
Like many problems in information extraction and NLP, address normalization
|
|
may sound trivial initially, but in fact can be quite complicated in real
|
|
natural language texts. Here are some examples of the kinds of address-specific
|
|
challenges libpostal can handle:
|
|
|
|
| Input | Output |
|
|
| ----------------------------------- |---------------------------------------|
|
|
| One-hundred twenty E 96th St | 120 east 96th street |
|
|
| C/ Ocho, P.I. 4 | calle 8, polígono industrial 4 |
|
|
| V XX Settembre, 20 | via 20 settembre, 20 |
|
|
| Quatre vignt douze Rue de l'Église | 92 rue de l' église |
|
|
| ул Каретный Ряд, д 4, строение 7 | улица каретныи ряд, дом 4, строение 7 |
|
|
| ул Каретный Ряд, д 4, строение 7 | ulica karetnyj rad, dom 4, stroenie 7 |
|
|
| Marktstrasse 14 | markt straße 14 |
|
|
|
|
For further reading and some less intuitive examples of addresses, see
|
|
"[Falsehoods Programmers Believe About Addresses](https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/)".
|
|
|
|
Why C (i.e. are you crazy)?
|
|
---------------------------
|
|
|
|
libpostal is written in C for three reasons (in order of importance):
|
|
|
|
1. **Portability/ubiquity**: libpostal targets higher-level languages that
|
|
people actually use day-to-day: Python, Go, Ruby, NodeJS, etc. The beauty of C
|
|
is that just about any programming language can bind to it and C compilers are
|
|
everywhere, so pick your favorite, write a binding, and you can use libpostal
|
|
directly in your application without having to stand up a separate server. We
|
|
support Mac/Linux (Windows is not a priority but happy to accept patches), have
|
|
a standard autotools build and an endianness-agnostic file format for the data
|
|
files. The Python bindings, are maintained as part of this repo since they're
|
|
needed to construct the training data.
|
|
|
|
2. **Memory-efficiency**: libpostal is designed to run in a MapReduce setting
|
|
where we may be limited to < 1GB of RAM per process depending on the machine
|
|
configuration. As much as possible libpostal uses contiguous arrays, tries
|
|
(built on contiguous arrays), bloom filters and compressed sparse matrices to
|
|
keep memory usage low. It's conceivable that libpostal could even be used on
|
|
a mobile device, although that's not an explicit goal of the project.
|
|
|
|
3. **Performance**: this is last on the list for a reason. Most of the
|
|
optimizations in libpostal are for memory usage rather than performance.
|
|
libpostal is quite fast given the amount of work it does. It can process
|
|
10-30k addresses / second in a single thread/process on the platforms we've
|
|
tested (that means processing every address in OSM planet in a little over
|
|
an hour). Check out the simple benchmark program to test on your environment
|
|
and various types of input. In the MapReduce setting, per-core performance
|
|
isn't as important because everything's being done in parallel, but there are
|
|
some streaming ingestion applications at Mapzen where this needs to
|
|
run in-process.
|
|
|
|
Design philosophy
|
|
-----------------
|
|
|
|
libpostal is written in modern, legible, C99.
|
|
|
|
- Keep it roughly object-oriented, as allowed by C
|
|
- Confine almost all mallocs to *name*_new and all frees to *name*_destroy
|
|
- Don't write custom hashtables, sorting algorithms, other undergrad CS stuff
|
|
- Use generic containers from klib where possible
|
|
- Take advantage of sparsity in all data structures
|
|
- Use char_array (inspired by [sds](https://github.com/antirez/sds)) when possible instead of C strings.
|
|
- Throughly test for memory leaks before pushing
|
|
- Keep it reasonably cross-platform compatible, particularly for *nix
|
|
|
|
Language dictionaries
|
|
----------------------
|
|
|
|
It's easy to add new languages/synonyms to libpostal by modifying a few text
|
|
files. The format of each dictionary file roughly resembles a
|
|
Lucene/Elasticsearch synonyms file:
|
|
|
|
```
|
|
drive|dr
|
|
street|st|str
|
|
road|rd
|
|
```
|
|
|
|
The leftmost string is treated as the canonical/normalized version. Synonyms
|
|
if any, are appended to the right, delimited by the pipe character.
|
|
|
|
The supported languages can be found in the [resources/dictionaries](https://github.com/openvenues/libpostal/tree/master/resources/dictionaries).
|
|
|
|
Each language can define one or more dictionaries (sometimes called "gazetteers" in NLP) to help with address parsing, and normalizing abbreviations. The dictionary types are:
|
|
|
|
- **academic_degrees.txt**: for post-nominal strings like "M.D.", "Ph.D.", etc.
|
|
- **ambiguous_expansions.txt**: e.g. "E" could be expanded to "East" or could
|
|
be "E Street", so if the string it encountered, it can either be left alone or expanded
|
|
- **building_types.txt**: strings indicating a building/house
|
|
- **company_types.txt**: company suffixes like "Inc" or "GmbH"
|
|
- **concatenated_prefixes_separable.txt**: things like "Hinter..." which can
|
|
be written either concatenated or as separate tokens
|
|
- **concatenated_suffixes_inseparable.txt**: Things like "...bg." => "...burg"
|
|
where the suffix cannot be separated from the main token, but either has an
|
|
abbreviated equivalent or simply can help identify the token in parsing as,
|
|
say, part of a street name
|
|
- **concatenated_suffixes_separable.txt**: Things like "...straße" where the
|
|
suffix can be either concatenated to the main token or separated
|
|
- **directionals.txt**: strings indicating directions (cardinal and
|
|
lower/central/upper, etc.)
|
|
- **level_types.txt**: strings indicating a particular floor
|
|
- **no_number.txt**: strings like "no fixed address"
|
|
- **nulls.txt**: strings meaning "not applicable"
|
|
- **personal_suffixes.txt**: post-nominal suffixes, usually generational
|
|
like Jr/Sr
|
|
- **personal_titles.txt**: civilian, royal and military titles
|
|
- **place_names.txt**: strings found in names of places e.g. "theatre",
|
|
"aquarium", "restaurant". See [Nominatim Special Phrases](http://wiki.openstreetmap.org/wiki/Nominatim/Special_Phrases)
|
|
- **post_office.txt**: strings like "p.o. box"
|
|
- **qualifiers.txt**: strings like "township"
|
|
- **stopwords.txt**: prepositions and articles mostly, very common words
|
|
which may be ignored in some contexts
|
|
- **street_types.txt**: words like "street", "road", "drive" which indicate
|
|
a thoroughfare and their respective abbreviations.
|
|
- **synonyms.txt**: any miscellaneous synonyms/abbreviations e.g. "bros"
|
|
expands to "brothers", etc. These have no special meaning and will essentially
|
|
just be treated as string replacement.
|
|
- **toponyms.txt**: abbreviations for certain abbreviations relating to
|
|
toponyms like regions, places, etc. Note: GeoNames covers most of these.
|
|
In most cases better to leave these alone
|
|
- **unit_types.txt**: strings indicating an apartment or unit number
|
|
|
|
Most of the dictionaries have been derived with the following process:
|
|
|
|
1. Tokenize all the streets in OSM for language x
|
|
2. Count the most common N tokens
|
|
3. Optionally use frequent item set mining to get frequent phrases
|
|
4. Run the most frequent words/phrases through Google Translate
|
|
5. Add the ones that mean "street" to dictionaries
|
|
6. Augment by researching addresses in countries speaking language x
|
|
|
|
In the future it might be beneficial to move the dictionaries to a wiki
|
|
so they can be crowdsourced by native speakers regardless of whether or not
|
|
they use git.
|
|
|
|
Installation
|
|
------------
|
|
|
|
For C users or those writing bindings (if you've written a languag
|
|
binding, please let us know!):
|
|
|
|
```
|
|
./bootstrap.sh
|
|
./configure --datadir=[...some dir with a few GB of space...]
|
|
make
|
|
sudo make install
|
|
```
|
|
|
|
libpostal needs to download some data files from S3. This is done automatically
|
|
when you run make. Mapzen maintains an S3 bucket containing said data files
|
|
but they can also be built manually.
|
|
|
|
To install via Python, just use:
|
|
|
|
```
|
|
pip install https://github.com/openvenues/libpostal.git
|
|
```
|
|
|
|
**Note**: The Python bindings don't implement libpostal's full API currently.
|
|
|
|
Command-line usage
|
|
------------------
|
|
|
|
After building libpostal:
|
|
|
|
```
|
|
cd src/
|
|
|
|
./libpostal "12 Three-hundred and forty-fifth ave, ste. no 678" en
|
|
#12 345th avenue, suite number 678
|
|
```
|
|
|
|
Currently libpostal requires two input strings, the address text and a language
|
|
code (ISO 639-1).
|
|
|
|
Todos
|
|
-----
|
|
|
|
1. Finish debugging/fully train address parser and publish model
|
|
2. Port language classification from Python, train and publish model
|
|
3. Python bindings and documentation
|
|
4. Publish tests (currently not on Github) and set up continuous integration
|
|
5. Hosted documentation |