Info |
---|
At this point, the following steps should already be achieved by the Data Integration (DI) team: |
The load request is done for every document JSONL (following the required Data Structure) required for your Data Integration process, once they are generated.
For a product data sync, make sure the following tables have been prepared and are provided: doc_product, doc_attributes, doc_attribute_values and doc_languages
Requirements
Note |
---|
The following requirements must be followed if you want to save the JSONL (#1) and load the JSONL (#2) in your custom project. If you are happy to use the Boxalino service for content storage & processing, you can ignore the requirements list. |
The load request will subject your content to:
saving your JSONL (for every data type required for the data sync type) in a Google Cloud Storage Bucket from the Boxalino project
for Instant Updates : the files have a retention policy of 1 day
for Delta Updates: the files have a retention policy of 7 days
for Full Updates: the files have a retention policy of 30 days
loading every JSONL in a BigQuery table in Boxalino project
Boxalino will provide access to the data Storage Buckets and BigQuery datasets that store your data.
BigQuery Datasets
Using different datasets for different data integration processes (delta, full, instant) will allow the set of default table expiration. By doing so, you are following BigQuery Storage Optimization best practices as described in the documentation.
1. Naming
The BigQuery dataset in which your account documents are loaded can be named after the data index and process it is meant to synchronize to:
if the request is for the dev data index - <client>_dev_<mode>
if the request is for the production data index - <client>_<mode>
, where:
<client> is the account name provided by Boxalino.
<mode> is the process F (for full), D (for delta), I (for instant)
Example, for our an account boxalino_client, the following datasets must exist (upon your integration use-cases):
for dev index: boxalino_client_dev_I, boxalino_client_dev_F, boxalino_client_dev_D
for production data index: boxalino_client_I, boxalino_client_F, boxalino_client_D
The above datasets must exist in your project.
2. Location
Upon the creation of your dataset, please use the Data Location: EU
...
3. Permissions
In order to have read & write access to your private dataset, please provide the following permissions to the Data Integration Service Account (DISA) 55483703770-compute@developer.gserviceaccount.com
:
BigQuery Data Viewer
BigQuery Metadata Viewer
BigQuery Data Editor / Owner
BigQuery Job User
or BigQuery Admin to the created datasets datasets.
Storage Bucket
1. Naming
In your custom project must be available the storage buckets in which #1 step can be done (loading your generated JSONL document).
Follow the Google documentation on how to create storage buckets.
The storage buckets have the requirement to be unique within the scope of the integration. Please use the following naming formula:
<your-custom-project>_<your-boxalino_account>_dev
<your-custom-project>_<your-boxalino-account>
2. Location
For content storage, please use either Multi-region (eu) or Region (europe-west1).
...
3. Permissions
In order to have load & read access to your private Google Cloud Storage buckets, please provide the following permissions to the Data Integration Service Account (DISA) 55483703770-compute@developer.gserviceaccount.com
:
Storage Object Creator
Storage Object Admin
The above permissions can be replaced with the Storage Admin role.
Request Definition
...
Endpoint
...
full data sync
...
Info |
---|
At this point, the following steps should already be achieved by the Data Integration (DI) team: |
The load request is done for every document JSONL (following the required Data Structure) required for your Data Integration process, once they are generated.
For a product data sync, make sure the following tables have been prepared and are provided: doc_product, doc_attributes, doc_attribute_values and doc_languages
Table of Contents |
---|
/load
The load request will do the following things:
It saves your JSONL (for every data type required for the data sync type) in a Google Cloud Storage (GCS) Bucket from the Boxalino project
the files are available for 30 days in the clients` GCS bucket
It loads every JSONL in a BigQuery table in Boxalino project
for Instant Updates : the tables have a retention policy of 1 day
for Delta Updates: the tables have a retention policy of 7 days
for Full Updates: the tables have a retention policy of 15 days
Boxalino will provide access to the data Storage Buckets and BigQuery datasets that store your data.
The service response is a JSON payload as documented here Status Review
Warning |
---|
If the service response is an error like: |
Request Definition
Endpoint | full data sync | https://boxalino-di-full-krceabfwya-ew.a.run.app | |
---|---|---|---|
1 | delta data sync | https://boxalino-di-delta-krceabfwya-ew.a.run.app | |
2 | instant-update data sync | https://boxalino-di-instant-update-krceabfwya-ew.a.run.app | |
3 | stage / testing | https://boxalino-di-stage-krceabfwya-ew.a.run.app | |
4 | Action | /load | |
5 | Method | POST | |
6 | Body | the document JSONL | |
7 | Headers | Authorization | Basic base64<<DATASYNC API key : DATASYNC API Secret>> |
8 | Content-Type | application/json | |
9 | doc | the document type (as described in Data Structure ) | |
10 | client | your Boxalino account name | |
11 | dev | only add it if the dev data index must be updated | |
12 | mode | D for delta , I for instant update, F for full technical: the constants from the GcpClientInterface https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/GcpRequestInterface.php#L18 | |
13 | type | integration type (product, order, etc) | |
14 | tm | time, in format: YmdHis technical: used to identify the version of the documents and create the content. | |
15 | ts | (optional) timestamp, must be millisecond based in UNIX timestamp technical: calculated from the time the data has been exported to BQ; the update requests will be applied in version ts order; https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/DiRequestTrait.php#L140 | |
16 | chunk | (optional) for loading content by chunks | |
17 | dataset | (optional) the dataset in which the doc_X tables must be stored; if not provided, the service will check the <index>_<mode> dataset in the Boxalino project, to which you will have access | |
18 | project | (optional) the project name where the documents are to be stored; | |
19 | bucket | (optional) the storage bucket where the doc_X will be loaded; if not provided, the service will use the Boxalino project. |
A LOAD REQUEST code-sample is available in the data-integration-doc-php library: https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/LoadTrait.php
Optionally, if you are to use an advanced rest client , you can configure it as such:
...
Integration Strategy
doc_attribute, doc_attribute_value, doc_language can be synchronized with a call to the load endpoint
because GCP has a size limit of 32MB for POST requests, do not use the
/load
service for big data exportsif the data is too big, the service returns a 413 ENTITY TOO LARGE response. In this case, switch to the batch load
doc_product / doc_order / doc_user / doc_content / etc and other content bigger than 32 MB must be loaded in batches (max size of 32MB) or via GCS Public Signed URL
/load/chunk
endpointNOTE: when using
/load/chunk
service to generate a Google Cloud Storage Public URL, there is no size limit! It is possible to load all data at once using one GCS Public URL link.
Load in Batches / data > 32 MB
For content over 32MB, we provide an endpoint to access a Signed GCS Url that would put all your streamed content into a file (currently there is no defined file size limit in GCS)
Read more about Google Cloud Signed URL https://cloud.google.com/storage/docs/access-control/signed-urls (response samples, uses, etc)
1. Make a request for public upload link
This is the generic POST request:
Code Block |
---|
curl --connect-timeout 60 --max-time 300 "https://boxalino-di-<mode>-krceabfwya-ew.a.run.app/load/chunk" \
-X POST \
-H "Content-Type: application/json" \
-H "client: <account>" \
-H "dev: true|false" \
-H "tm: YYYYmmddHHiiss" \
-H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \
-H "mode: <mode>" \
-H "chunk: <id>" \
-H "doc: doc_product|doc_order|doc_user|doc_content|.." \
-H "Authorization: Basic <encode of the account>" |
The response will be an upload link that can only be used in order to create the file in the clients` GCS bucket. The link is valid for 30 minutes.
Note |
---|
NOTE: The HTTP headers must include a new property - chunk. The chunk value (number or textual) is the order of the batch / pagination / SEEK strategy used for content segmentation / etc. It is part of the final file available in GCS, to ensure that the Signed GCS Url is unique. |
2. Upload the content on the public link
A code sample is available in our generic PHP library
Code Block |
---|
curl --connect-timeout 60 --timeout 0 <GCS-signed-url> \
-X PUT \
-H "Content-Type: application/octet-stream" \
-d "<YOUR DOCUMENT JSONL CONTENT (STREAM)>" |
Panel | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
The use of the header |
Tip |
---|
Step #1 - #2 must be repeated for every batch that is required to be added for the given process (same tm, mode & type) Only after all the data was loaded, you can move on to step#3. |
3. Load data to BigQuery
Make an HTTP POST call to load/bq endpoint. A code sample is available in our generic PHP library
This is the generic POST request:
Code Block |
---|
curl --connect-timeout 60 --max-time 300 "https://boxalino-di-<mode>-krceabfwya-ew.a.run.app |
...
delta data sync
...
https://boxalino-di-delta-krceabfwya-ew.a.run.app
...
instant-update data sync
...
https://boxalino-di-instant-update-krceabfwya-ew.a.run.app
...
stage / testing
...
https://boxalino-di-stage-krceabfwya-ew.a.run.app
...
Action
...
/load
...
Method
...
POST
...
Body
...
the document JSONL
...
Headers
...
Authorization
...
Basic base64<<DATASYNC API key : DATASYNC API Secret>>
...
Content-Type
...
application/json
...
project
...
(optional) the project name where the documents are to be stored;
...
dataset
(optional) the dataset in which the doc_X tables must be stored;
...
bucket
...
(optional) the storage bucket where the doc_X will be loaded;
if not provided, the service will use the Boxalino project.
...
doc
...
the document type (as described in Data Structure )
...
client
...
your Boxalino account name
...
dev
...
only add it if the dev data index must be updated
...
mode
...
D for delta , I for instant update, F for full
technical: the constants from the GcpClientInterface https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/GcpRequestInterface.php#L18
...
tm
...
time, in format: YmdHis
technical: used to identify the documents version
...
ts
...
timestamp, must be millisecond based in UNIX timestamp
technical: calculated from the time the data has been exported to BQ; the update requests will be applied in version ts order; https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/DiRequestTrait.php#L140
...
chunk
...
(optional) for loading content by chunks
...
type
...
(optional) integration type (product, order, etc)
https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/GcpRequestInterface.php#L22
A LOAD REQUEST code-sample is available in the data-integration-doc-php library: https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/LoadTrait.php
Optionally, if you are to use an advanced rest client , you can configure it as such:
...
Integration Strategy
doc_attribute, doc_attribute_value, doc_language can be synchronized with a call to the load endpoint
doc_product can be loaded in batches (especially when synchronized in F/full mode)
Load By Chunks
In order to upload the content in chunks, the following requests are required:
make an HTTP POST call to /load/chunk endpoint .
It will return a public URL which can be used to load content https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/GcsLoadUrlTrait.php
The HTTP headers must include a new property - chunk, with the number of the load
with the response from step #1, load the document content (PUT)
repeat step#1 + step#2 (in a loop) until all your product/order/customers content has been uploaded
make an HTTP POST call to load/bq endpoint
It will inform BQ to load the stored GCS content to your dataset
https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/LoadBqTrait.php
...
The HTTP headers must have the required properties as described in the Request Defintion
...
/load/bq" \
-X POST \
-H "Content-Type: application/json" \
-H "client: <account>" \
-H "dev: true|false" \
-H "tm: YYYYmmddHHiiss" \
-H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \
-H "mode: <mode>" \
-H "doc: doc_product|doc_order|doc_user|doc_content|.." \
-H "Authorization: Basic <encode of the account>" |
Info |
---|
The HTTP headers must have the required properties as described in the Request Definition data-integration-doc-php/DiRequestTrait.php at 3.2.0 ยท boxalino/data-integration-doc-php |
Fallback / retry policies
We do our best to ensure that our services are continuously available to receive requests. The endpoints have a timeout of 3600s. This means that your POST request will receive a response when the service finished your request (either error or succes) or when the timeout is reached.
The /load
, /load/chunk
, /load/bq
have a succesfull response on average in 10sec (so the wait time is minimal). Howether, we recommend for your fallback/retry policy, to use a timeout
configuration of 3-5 min.
Based on your system, the timeout
is represented by different parameters for the HTTP request.
Code Block |
---|
curl --max-time 10 --connect-timeout 30 <SERVICE> -X POST -d <data> |
--max-time
is the timeout: wait 10s for a response from the server or stop connection--connect-timeout
is the connection timeout: wait 30s to make a connection to the service or stop connection
Instead of stop connection, we recommend to integrate a fallback policy with a retry.
Using private GCP resources for DI
Note |
---|
The following requirements must be followed if you want to save the JSONL (#1) and load the JSONL (#2) in your custom project. If you are happy to use the Boxalino service for content storage & processing, you can ignore the requirements list. |
For the case when the client wants to use their own GCP resources for the Data Integration with Boxalino, the following requirements must be complied with:
BigQuery Datasets
Using different datasets for different data integration processes (delta, full, instant) will allow the set of default table expiration. By doing so, you are following BigQuery Storage Optimization best practices as described in the documentation.
1. Naming
The BigQuery dataset in which your account documents are loaded can be named after the data index and process it is meant to synchronize to:
if the request is for the dev data index - <client>_dev_<mode>
if the request is for the production data index - <client>_<mode>
, where:
<client> is the account name provided by Boxalino.
<mode> is the process F (for full), D (for delta), I (for instant)
Example, for our an account boxalino_client, the following datasets must exist (upon your integration use-cases):
for dev index: boxalino_client_dev_I, boxalino_client_dev_F, boxalino_client_dev_D
for production data index: boxalino_client_I, boxalino_client_F, boxalino_client_D
The above datasets must exist in your project.
2. Location
Upon the creation of your dataset, please use the Data Location: EU
...
3. Permissions
In order to have read & write access to your private dataset, please provide the following permissions to the Data Integration Service Account (DISA) 55483703770-compute@developer.gserviceaccount.com
:
BigQuery Data Viewer
BigQuery Metadata Viewer
BigQuery Data Editor / Owner
BigQuery Job User
or BigQuery Admin to the created datasets datasets.
Storage Bucket
1. Naming
In your custom project must be available the storage buckets in which #1 step can be done (loading your generated JSONL document).
Follow the Google documentation on how to create storage buckets.
The storage buckets have the requirement to be unique within the scope of the integration. Please use the following naming formula:
<your-custom-project>_<your-boxalino_account>_dev
<your-custom-project>_<your-boxalino-account>
2. Location
For content storage, please use either Multi-region (eu) or Region (europe-west1).
...
3. Permissions
In order to have load & read access to your private Google Cloud Storage buckets, please provide the following permissions to the Data Integration Service Account (DISA) 55483703770-compute@developer.gserviceaccount.com
:
Storage Object Creator
Storage Object Admin
The above permissions can be replaced with the Storage Admin role.