Content Comparison

Info

At this point, the following steps should already be achieved by the Data Integration (DI) team:

Create the JSONL files https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/252149803/Data+Integration#1.-Export-JSONL-files

The load request is done for every document JSONL (following the required Data Structure) required for your Data Integration process, once they are generated.

For a product data sync, make sure the following tables have been prepared and are provided: doc_product, doc_attributes, doc_attribute_values and doc_languages

Requirements

Note

The following requirements must be followed if you want to save the JSONL (#1) and load the JSONL (#2) in your custom project.

If you are happy to use the Boxalino service for content storage & processing, you can ignore the requirements list.

Table of Contents

/load

The load request will subject your content todo the following things:

saving your JSONL (for every data type required for the data sync type) in a Google Cloud Storage (GCS) Bucket from the Boxalino project
1. the files are available for 30 days in the clients` GCS bucket
loading every JSONL in a BigQuery table in Boxalino project
1. for Instant Updates : the

...

1. tables have a retention policy of 1 day
2. for Delta Updates: the

...

1. tables have a retention policy of 7 days
2. for Full Updates:

...

1. thetables have a retention policy of 30 days

...

loading every JSONL in a BigQuery table in Boxalino project

Boxalino will provide access to the data Storage Buckets and BigQuery datasets that store your data.

...

Using different datasets for different data integration processes (delta, full, instant) will allow the set of default table expiration. By doing so, you are following BigQuery Storage Optimization best practices as described in the documentation.

1. Naming

The BigQuery dataset in which your account documents are loaded can be named after the data index and process it is meant to synchronize to:

if the request is for the dev data index - <client>_dev_<mode>
if the request is for the production data index - <client>_<mode>

, where:

<client> is the account name provided by Boxalino.
<mode> is the process F (for full), D (for delta), I (for instant)

Example, for our an account boxalino_client, the following datasets must exist (upon your integration use-cases):

for dev index: boxalino_client_dev_I, boxalino_client_dev_F, boxalino_client_dev_D
for production data index: boxalino_client_I, boxalino_client_F, boxalino_client_D

The above datasets must exist in your project.

2. Location

Upon the creation of your dataset, please use the Data Location: EU

...

3. Permissions

In order to have read & write access to your private dataset, please provide the following permissions to the Data Integration Service Account (DISA) 55483703770-compute@developer.gserviceaccount.com:

BigQuery Data Viewer
BigQuery Metadata Viewer
BigQuery Data Editor / Owner
BigQuery Job User

or BigQuery Admin to the created datasets datasets.

Storage Bucket

1. Naming

In your custom project must be available the storage buckets in which #1 step can be done (loading your generated JSONL document).

Follow the Google documentation on how to create storage buckets.

The storage buckets have the requirement to be unique within the scope of the integration. Please use the following naming formula:

<your-custom-project>_<your-boxalino_account>_dev
<your-custom-project>_<your-boxalino-account>

2. Location

For content storage, please use either Multi-region (eu) or Region (europe-west1).

...

3. Permissions

In order to have load & read access to your private Google Cloud Storage buckets, please provide the following permissions to the Data Integration Service Account (DISA) 55483703770-compute@developer.gserviceaccount.com:

Storage Object Creator
Storage Object Admin

The above permissions can be replaced with the Storage Admin role.

Request Definition

	Endpoint	full data sync	https://boxalino-di-full-krceabfwya-ew.a.run.app
1		delta data sync	https://boxalino-di-delta-krceabfwya-ew.a.run.app
2		instant-update data sync	https://boxalino-di-instant-update-krceabfwya-ew.a.run.app
3		stage / testing	https://boxalino-di-stage-krceabfwya-ew.a.run.app
4	Action	/load
5	Method	POST
6	Body	the document JSONL
7	Headers	Authorization	Basic base64<<DATASYNC API key : DATASYNC API Secret>> Note: only the API credentials with role=ADMIN are valid. https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/713785345/API+Credentials#Roles
8		Content-Type	application/json
9		project	(optional) the project name where the documents are to be stored;
10		dataset	(optional) the dataset in which the doc_X tables must be stored; if not provided, the service will check the <index>_<mode> dataset in the Boxalino project, to which you will have access
11		bucket	(optional) the storage bucket where the doc_X will be loaded; if not provided, the service will use the Boxalino project.
12		doc	the document type (as described in Data Structure )
13		client	your Boxalino account name
14		dev	only add it if the dev data index must be updated
15		mode	D for delta , I for instant update, F for full technical: the constants from the GcpClientInterface https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/GcpRequestInterface.php#L18
16		tm	time, in format: YmdHis requirement: the same tm value must be used from the begging of the DI process until the end, for all files. technical: used to identify the version of the documents and create the content.
17		ts	timestamp, must be millisecond based in UNIX timestamp technical: calculated from the time the data has been exported to BQ; the update requests will be applied in version ts order; https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/DiRequestTrait.php#L140
18		type	integration type (product, order, etc) https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/GcpRequestInterface.php#L22
19		chunk	(optional) for loading content by chunks see https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/415432770/Load+Request#Load-By-Chunks

A LOAD REQUEST code-sample is available in the data-integration-doc-php library: https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/LoadTrait.php

Optionally, if you are to use an advanced rest client , you can configure it as such:

...

doc_attribute, doc_attribute_value, doc_language can be synchronized with a call to the load endpoint
- because GCP has a size limit of 32MB for POST requests, do not use the /load service for big data exports
doc_product / doc_order / doc_user / doc_content / etc and other content bigger than 32 MB must be loaded in batches / via public GCS link
- GCP has a size limit of 256MB for POST requests
- we recommend avoiding it by receiving a public GCS load URL (steps bellow)

...

(max size of 32MB) or via GCS Public Signed URL /load/chunk endpoint

Load in Batches / data > 32 MB

In order to upload the content in chunks or in big data sizes (GB of data), the following requests are required:

make an HTTP POST call to /load/chunk endpoint .
1. The HTTP headers must include a new property - chunk.
  1. the chunk value (number or textual) can be the order of the batch / pagination / SEEK strategy used for content segmentation / etc
2. This endpoint returns a public GCS Signed URL (https://cloud.google.com/storage/docs/access-control/signed-urls ). This is used to load content https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/GcsLoadUrlTrait.php
  1. the received link is generated by GCS. It will be unique per each loaded segment.
with the response from step #1, load the document content (PUT)
1. https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/LoadByChunkTrait.php#L24
repeat step#1 + step#2 (in a loop) until all your product/order/customers content has been uploaded
1. the chunk value is updated (as part of the iteration)
2. the same tm value must be used
make an HTTP POST call to load/bq endpoint
1. It will inform BQ to load the stored GCS content to your dataset
2. https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/LoadBqTrait.php

Info

The HTTP headers must have the required properties as described in the Request Definition

https://github.com/data-integration-doc-php/DiRequestTrait.php at 3.2.0 · boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/DiRequestTrait.php#L175

Fallback / retry policies

We do our best to ensure that our services are continuously available to receive requests. The endpoints have a timeout of 3600s. This means that your POST request will receive a response when the service finished your request (either error or succes) or when the timeout is reached.

The /load , /load/chunk , /load/bq have a succesfull response on average in 10sec (so the wait time is minimal). Howether, we recommend for your fallback/retry policy, to use a timeout configuration of 3-5 min.

Based on your system, the timeout is represented by different parameters for the HTTP request.

Code Block
curl --max-time 10 --connect-timeout 30 <SERVICE> -X POST -d <data>

--max-time is the timeout: wait 10s for a response from the server or stop connection
--connect-timeout is the connection timeout: wait 30s to make a connection to the service or stop connection

Instead of stop connection, we recommend to integrate a fallback policy with a retry.

Using private GCP resources for DI

Note

The following requirements must be followed if you want to save the JSONL (#1) and load the JSONL (#2) in your custom project.

If you are happy to use the Boxalino service for content storage & processing, you can ignore the requirements list.

For the case when the client wants to use their own GCP resources for the Data Integration with Boxalino, the following requirements must be complied with:

BigQuery Datasets

Using different datasets for different data integration processes (delta, full, instant) will allow the set of default table expiration. By doing so, you are following BigQuery Storage Optimization best practices as described in the documentation.

1. Naming

The BigQuery dataset in which your account documents are loaded can be named after the data index and process it is meant to synchronize to:

if the request is for the dev data index - <client>_dev_<mode>
if the request is for the production data index - <client>_<mode>

, where:

<client> is the account name provided by Boxalino.
<mode> is the process F (for full), D (for delta), I (for instant)

Example, for our an account boxalino_client, the following datasets must exist (upon your integration use-cases):

for dev index: boxalino_client_dev_I, boxalino_client_dev_F, boxalino_client_dev_D
for production data index: boxalino_client_I, boxalino_client_F, boxalino_client_D

The above datasets must exist in your project.

2. Location

Upon the creation of your dataset, please use the Data Location: EU

...

3. Permissions

In order to have read & write access to your private dataset, please provide the following permissions to the Data Integration Service Account (DISA) 55483703770-compute@developer.gserviceaccount.com:

BigQuery Data Viewer
BigQuery Metadata Viewer
BigQuery Data Editor / Owner
BigQuery Job User

or BigQuery Admin to the created datasets datasets.

Storage Bucket

1. Naming

In your custom project must be available the storage buckets in which #1 step can be done (loading your generated JSONL document).

Follow the Google documentation on how to create storage buckets.

The storage buckets have the requirement to be unique within the scope of the integration. Please use the following naming formula:

<your-custom-project>_<your-boxalino_account>_dev
<your-custom-project>_<your-boxalino-account>

2. Location

For content storage, please use either Multi-region (eu) or Region (europe-west1).

...

3. Permissions

In order to have load & read access to your private Google Cloud Storage buckets, please provide the following permissions to the Data Integration Service Account (DISA) 55483703770-compute@developer.gserviceaccount.com:

Storage Object Creator
Storage Object Admin

The above permissions can be replaced with the Storage Admin role.

Version	Old Version 12	New Version 16
Changes made by	Dana Negrescu	Dana Negrescu
Saved on	Jan 27, 2022	Feb 06, 2023

Versions Compared

Key

Requirements

/load

1. Naming

2. Location

3. Permissions

Storage Bucket

1. Naming

2. Location

3. Permissions

Request Definition

Load in Batches / data > 32 MB

Fallback / retry policies

Using private GCP resources for DI

BigQuery Datasets

1. Naming

2. Location

3. Permissions

Storage Bucket

1. Naming

2. Location

3. Permissions