Modifying README and config parameter, based on code review.

This commit is contained in:
Oskar Thorbjornsson
2023-02-14 21:02:51 -08:00
parent 0c0818c683
commit 00568da290
3 changed files with 15 additions and 12 deletions

View File

@@ -178,13 +178,20 @@ lib.exe /def:libpostal.def /out:libpostal.lib /machine:x64
Installation with an alternative data model
-------------------------------------------
An alternative data model is available for libposta. It is created by Senzing Inc. for improved parsing on US, UK and Singapore addresses and improved US rural route address handling.
To enable this add `--enable-senzing-datamodel` to the conigure line during installation:
An alternative data model is available for libpostal. It is created by Senzing Inc. for improved parsing on US, UK and Singapore addresses and improved US rural route address handling.
To enable this add `MODEL=senzing` to the conigure line during installation:
```
./configure --datadir=[...some dir with a few GB of space...] --enable-senzing-datamodel
./configure --datadir=[...some dir with a few GB of space...] MODEL=senzing
```
Further information about this data model can be found at: https://github.com/Senzing/libpostal-data
The data for this model is gotten from [OpenAddress](https://openaddresses.io/), [OpenStreetMap](https://www.openstreetmap.org/) and data generated by Senzing based on customer feedback (a few hundred records), a total of about 1.2 billion records of data from over 230 countries, in 100+ languages. The data from OpenStreetMap and OpenAddress is good but not perfect so the data set was modified by filtering out badly formed addresses, correcting misclassified address tokens and removing tokens that didn't belong in the addresses, whenever these conditions were encountered.
Senzing created a data set of 12950 addresses from 89 countries that it uses to test and verify the quality of its models. The data set was generated using random addresses from OSM, minimally 50 per country. Hard-to-parse addresses were gotten from Senzing support team and customers and from the libpostal github page and added to this set. The Senzing model got 4.3% better parsing results than the default model, using this test set.
The size of this model is about 2.2GB compared to 1.8GB for the default model so keep that in mind if storages space is important.
Further information about this data model can be found at: https://github.com/Senzing/libpostal-data
If you run into any issues with this model, whether they have to do with parses, installation or any other problems, then please report them at https://github.com/Senzing/libpostal-data
Examples of parsing
-------------------

View File

@@ -145,13 +145,9 @@ AC_ARG_ENABLE([data-download],
*) AC_MSG_ERROR([bad value ${enableval} for --disable-data-download]) ;;
esac], [DOWNLOAD_DATA=true])
AC_ARG_ENABLE([senzing-datamodel],
AS_HELP_STRING([[[--enable-senzing-datamodel]]],
[Use Senzing data model in lieu of the default one]),
[
DATAMODEL="senzing"
AC_SUBST([LIBPOSTAL_DATA_MODEL], [$DATAMODEL])
])
AC_ARG_VAR(MODEL, [Option to use alternative data models. Currently available is "senzing" (MODEL=senzing). If this option is not set the default libpostal data model is used.])
AS_VAR_IF([MODEL], [], [],
[AS_VAR_IF([MODEL], [senzing], [], [AC_MSG_FAILURE([Invalid MODEL value set])])])
AM_CONDITIONAL([DOWNLOAD_DATA], [test "x$DOWNLOAD_DATA" = "xtrue"])

View File

@@ -14,7 +14,7 @@ LIBPOSTAL_DATA_DIR=$3
MB=$((1024*1024))
CHUNK_SIZE=$((64*$MB))
DATAMODEL="@LIBPOSTAL_DATA_MODEL@"
DATAMODEL="@MODEL@"
# Not loving this approach but there appears to be no way to query the size
# of a release asset without using the Github API