Files
libpostal/resources/dictionaries
..
2016-07-21 17:04:57 -04:00
2016-03-13 22:31:13 -04:00

Language dictionaries

It's easy to add new languages/synonyms to libpostal by modifying a few text files. The format of each dictionary file roughly resembles a Lucene/Elasticsearch synonyms file:

drive|dr
street|st|str
road|rd

The leftmost string is treated as the canonical/normalized version. Synonyms if any, are appended to the right, delimited by the pipe character.

The supported languages can be found in the resources/dictionaries.

Each language can define one or more dictionaries (sometimes called "gazetteers" in NLP) to help with address parsing, and normalizing abbreviations. The dictionary types are:

  • academic_degrees.txt: for post-nominal strings like "M.D.", "Ph.D.", etc.
  • ambiguous_expansions.txt: e.g. "E" could be expanded to "East" but could be "E Street", so if the string is encountered, it can either be left alone or expanded. In general, single-letter abbreviations in most languages should also be added to ambiguous_expansions.txt since single letters are also often initials
  • building_types.txt: strings indicating a building/house
  • categories.txt: category strings e.g. from Nominatim Special Phrases expected to be used in searches like "restaurants in Manhattan". Singular and plural forms can be included here.
  • company_types.txt: company suffixes like "Inc" or "GmbH"
  • concatenated_prefixes_separable.txt: things like "Hinter..." which can be written either concatenated or as separate tokens
  • concatenated_suffixes_inseparable.txt: Things like "...bg." => "...burg" where the suffix cannot be separated from the main token, but either has an abbreviated equivalent or simply can help identify the token in parsing as, say, part of a street name
  • concatenated_suffixes_separable.txt: Things like "...straße" where the suffix can be either concatenated to the main token or separated
  • directionals.txt: strings indicating directions (cardinal and lower/central/upper, etc.)
  • entrance.txt: string indicating an entrance, usually just the word "entrance" and its appropriate abbreviations.
  • house_number.txt: strings that may be added as part of the house/building number (for languages like Spanish where it's common to say "No. 123" or "No. Ext. 123" for the house/building number instead of just "123" as in English).
  • level_types_basement.txt: strings indicating a basement level.
  • level_types_mezzanine.txt: strings indicating a mezzanine level.
  • level_types_numbered.txt: strings indicating a numbered level of a building (numbered).
  • level_types_standalone.txt: strings indicating a level/floor of a building that can stand on their own without a number like "ground floor", etc.
  • level_types_sub_basement.txt: strings indicating a sub-basement level.
  • no_number.txt: strings like "sin número" used for houses with no number.
  • nulls.txt: strings meaning "not applicable" e.g. in spreadsheets or database fields that might have missing values
  • personal_suffixes.txt: post-nominal suffixes, usually generational e.g. Junior/Senior in English or der Jungere in German.
  • personal_titles.txt: civilian, royal, clerical, and military titles e.g. "Saint", "General", etc.
  • place_names.txt: strings found in names of places e.g. "theatre", "aquarium", "restaurant". Nominatim Special Phrases is a great resource for this.
  • post_office.txt: strings like "p.o. box"
  • qualifiers.txt: strings like "township"
  • staircase.txt: strings indicating a staircase, usually just the word "staircase" or "stair".
  • stopwords.txt: prepositions and articles mostly, very common words which may be ignored in some contexts
  • street_types.txt: words like "street", "road", "drive" which indicate a thoroughfare and their respective abbreviations.
  • synonyms.txt: any miscellaneous synonyms/abbreviations e.g. "bros" expands to "brothers", etc. These have no special meaning and will essentially just be treated as string replacement.
  • toponyms.txt: abbreviations for certain abbreviations relating to toponyms like regions, places, etc. Note: GeoNames covers most of these. In most cases better to leave these alone
  • unit_directions.txt: phrases to indicate which side of the building the apartment/unit is on, usually along the lines of "left", "right", "front", "rear".
  • unit_types_numbered.txt: strings indicating a apartment or unit e.g. we expect a number to follow (or in some languages, precede) strings like "flat", "apt", "unit", etc.
  • unit_types_standalone.txt: for unit type that can stand on their own without an accompanying number e.g. "penthouse".

Most of the dictionaries have been derived using the following process:

  1. Tokenize every street/venue name in OSM for language x using libpostal
  2. Count the most common tokens
  3. Use the Apriori algorithm to extract multi-word phrases
  4. Run the most frequent words/phrases through Google Translate
  5. Add the ones that mean "street" (or other relevant words) to dictionaries
  6. Augment by researching addresses in countries speaking language x

Contributing

If you're a native speaker of one or more languages in libpostal, we'd love your contribution! It's as simple as editing the text files under this directory and submitting a pull request. Dictionaries are organized by language code, so feel free to find any language you speak and start editing! If you don't see your language, just add a directory - there's no additional configuration needed.

To get started adding new language dictionaries or improving support for existing languages, check out the address_languages repo, where we've published lists of 1-5 word phrases found in street/venue names in every language in OSM.

In the future it might be beneficial to move these dictionaries to a wiki so they can be crowdsourced by native speakers regardless of whether or not they use git.