From d2732922c249f55a18550a7305642a2521bb938c Mon Sep 17 00:00:00 2001 From: Al Date: Mon, 17 Apr 2017 14:11:44 -0400 Subject: [PATCH] [data] deployed model files and training data to CloudFront for easier downloading around the world and in places like China where the Great Fire Wall may prevent large downloads from abroad. TTL is set to 0 so it still caches the files themselves but checks with origin for the If-Modified-Since headers, allowing the files to be updated dynamically --- README.md | 2 +- src/libpostal_data | 5 +++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index af9d0108..b3f8d724 100644 --- a/README.md +++ b/README.md @@ -405,7 +405,7 @@ Libpostal is a bit different because it's trained on open data that's available Training data are stored on S3 by the date they were created. There's also a file stored on S3 to point to the most recent training data. To always point to the latest data, use something like: ```latest=$(curl https://s3.amazonaws.com/libpostal/training_data/latest)``` and use that variable in place of the date. ### Parser training sets ### -All files can be found under s3://libpostal/training_data/YYYY-MM-DD/parser/ as gzip'd tab-separated values (TSV) files formatted like:```language\tcountry\taddress```. +All files can be found at https://d1p366rbd94x8u.cloudfront.net/training_data/$YYYY-MM-DD/parser/$FILE as gzip'd tab-separated values (TSV) files formatted like:```language\tcountry\taddress```. - **formatted_addresses_tagged.random.tsv.gz** (ODBL): OSM addresses. Apartments, PO boxes, categories, etc. are added primarily to these examples - **formatted_places_tagged.random.tsv.gz** (ODBL): every toponym in OSM (even cities represented as points, etc.), reverse-geocoded to its parent admins, possibly including postal codes if they're listed on the point/polygon. Every place gets a base level of representation and places with higher populations get proportionally more. diff --git a/src/libpostal_data b/src/libpostal_data index 1b879a4a..9b337750 100755 --- a/src/libpostal_data +++ b/src/libpostal_data @@ -11,7 +11,8 @@ LIBPOSTAL_VERSION_STRING="v1" LIBPOSTAL_S3_BUCKET_NAME="libpostal" LIBPOSTAL_S3_KEY="s3://$LIBPOSTAL_S3_BUCKET_NAME" -LIBPOSTAL_S3_BUCKET_URL="http://$LIBPOSTAL_S3_BUCKET_NAME.s3.amazonaws.com" +LIBPOSTAL_S3_BUCKET_URL="https://$LIBPOSTAL_S3_BUCKET_NAME.s3.amazonaws.com" +LIBPOSTAL_CLOUDFRONT_URL="https://d1p366rbd94x8u.cloudfront.net" LIBPOSTAL_DATA_FILE="libpostal_data.tar.gz" LIBPOSTAL_PARSER_FILE="parser.tar.gz" LIBPOSTAL_LANG_CLASS_FILE="language_classifier.tar.gz" @@ -112,7 +113,7 @@ download_file() { echo "Checking for new libpostal $name..." - url=$LIBPOSTAL_S3_BUCKET_URL/$prefix/$filename + url=$LIBPOSTAL_CLOUDFRONT_URL/$prefix/$filename if [ $(curl -sI $url -z "$(cat $updated_path)" --remote-time -w %{http_code} -o /dev/null | grep "^200$") ]; then echo "New libpostal $name available"