Content Comparison

Info

At this point, the following steps should already be achieved by the Data Integration (DI) team:

Create the JSONL files https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/252149803/Data+Integration#1.-Export-JSONL-files

The load request is done for every document JSONL (following the required Data Structure) required for your Data Integration process, once they are generated.

For a product data sync, make sure the following tables have been prepared and are provided: doc_product, doc_attributes, doc_attribute_values and doc_languages

Table of Contents

/load

The load request will do the following things:

saving your JSONL (for every data type required for the data sync type) in a Google Cloud Storage (GCS) Bucket from the Boxalino project
1. the files are available for 30 days in the clients` GCS bucket
loading every JSONL in a BigQuery table in Boxalino project
1. for Instant Updates : the tables have a retention policy of 1 day
2. for Delta Updates: the tables have a retention policy of 7 days
3. for Full Updates: thetables have a retention policy of 30 days

Boxalino will provide access to the data Storage Buckets and BigQuery datasets that store your data.

Request Definition

...

Endpoint

...

full data sync

...

https://boxalino-di-full-krceabfwya-ew.a.run.app

...

delta data sync

...

https://boxalino-di-delta-krceabfwya-ew.a.run.app

...

instant-update data sync

...

https://boxalino-di-instant-update-krceabfwya-ew.a.run.app

...

stage / testing

...

https://boxalino-di-stage-krceabfwya-ew.a.run.app

...

Action

...

/load

...

Method

...

POST

...

Body

...

the document JSONL

...

Headers

...

Authorization

...

Content-Type

...

application/json

...

project

...

(optional) the project name where the documents are to be stored;

...

dataset

(optional) the dataset in which the doc_X tables must be stored;

if not provided, the service will check the <index>_<mode> dataset in the Boxalino project, to which you will have access

...

bucket

...

(optional) the storage bucket where the doc_X will be loaded;

if not provided, the service will use the Boxalino project.

...

doc

...

the document type (as described in Data Structure )

...

client

...

your Boxalino account name

...

dev

...

only add it if the dev data index must be updated

...

mode

...

D for delta , I for instant update, F for full

technical: the constants from the GcpClientInterface https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/GcpRequestInterface.php#L18

...

tm

...

time, in format: YmdHis
requirement: the same tm value must be used from the begging of the DI process until the end, for all files.

technical: used to identify the version of the documents and create the content.

...

ts

timestamp, must be millisecond based in UNIX timestamp

...

Info

At this point, the following steps should already be achieved by the Data Integration (DI) team:

Create the JSONL files https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/252149803/Data+Integration#1.-Export-JSONL-files

The load request is done for every document JSONL (following the required Data Structure) required for your Data Integration process, once they are generated.

For a product data sync, make sure the following tables have been prepared and are provided: doc_product, doc_attributes, doc_attribute_values and doc_languages

Table of Contents

/load

The load request will do the following things:

It saves your JSONL (for every data type required for the data sync type) in a Google Cloud Storage (GCS) Bucket from the Boxalino project
1. the files are available for 30 days in the clients` GCS bucket
It loads every JSONL in a BigQuery table in Boxalino project
1. for Instant Updates : the tables have a retention policy of 1 day
2. for Delta Updates: the tables have a retention policy of 7 days
3. for Full Updates: the tables have a retention policy of 15 days

Boxalino will provide access to the data Storage Buckets and BigQuery datasets that store your data.

The service response is a JSON payload as documented here Status Review

Warning
If the service response is an error like: `413 Request Entity Too Large` - please use the `/load/chunk` flow.

Request Definition

	Endpoint	full data sync	https://boxalino-di-full-krceabfwya-ew.a.run.app
1		delta data sync	https://boxalino-di-delta-krceabfwya-ew.a.run.app
2		instant-update data sync	https://boxalino-di-instant-update-krceabfwya-ew.a.run.app
3		stage / testing	https://boxalino-di-stage-krceabfwya-ew.a.run.app
4	Action	/load
5	Method	POST
6	Body	the document JSONL
7	Headers	Authorization	Basic base64<<DATASYNC API key : DATASYNC API Secret>> Note: only the API credentials with role=ADMIN are valid. https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/713785345/API+Credentials#Roles
8		Content-Type	application/json
9		doc	the document type (as described in Data Structure )
10		client	your Boxalino account name
11		dev	only add it if the dev data index must be updated
12		mode	D for delta , I for instant update, F for full technical: the constants from the GcpClientInterface https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/

Flow/DiRequestTrait
GcpRequestInterface.
php#L140
php#L18
18
13
type
integration type (product, order, etc)
https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/GcpRequestInterface.php#L22
14
19
chunk
(optional) for loading content by chunks
see https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/415432770/Load+Request#Load-By-Chunks

...

tm

time, in format: YmdHis
requirement: the same tm value must be used from the begging of the DI process until the end, for all files.

technical: used to identify the version of the documents and create the content.

15

ts

(optional) timestamp, must be millisecond based in UNIX timestamp

technical: calculated from the time the data has been exported to BQ; the update requests will be applied in version ts order; https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/

...

DiRequestTrait.

...

php#L140

Optionally, if you are to use an advanced rest client , you can configure it as such:

...

Integration Strategy

doc_attribute, doc_attribute_value, doc_language can be synchronized with a call to the load endpoint
- because GCP has a size limit of 32MB for POST requests, do not use the /load service for big data exports
doc_product / doc_order / doc_user / doc_content / etc and other content bigger than 32 MB must be loaded in batches (max size of 32MB) or via GCS Public Signed URL /load/chunk endpoint

Load in Batches / data > 32 MB

In order to upload the content in chunks or in big data sizes (GB of data), the following requests are required:

make an HTTP POST call to /load/chunk endpoint .

...

The HTTP headers must include a new property - chunk.

the chunk value (number or textual) can be the order of the batch / pagination / SEEK strategy used for content segmentation / etc

...

if it is not set, our system sets it
16	chunk	(optional) for loading content by chunks see https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/415432770/Load+Request#Load-By-Chunks
17	dataset	(optional) the dataset in which the doc_X tables must be stored; if not provided, the service will check the <index>_<mode> dataset in the Boxalino project, to which you will have access
18	project	(optional) the project name where the documents are to be stored;
19	bucket	(optional) the storage bucket where the doc_X will be loaded; if not provided, the service will use the Boxalino project.

A LOAD REQUEST code-sample is available in the data-integration-doc-php library: https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/

...

LoadTrait.php

the received link is generated by GCS. It will be unique per each loaded segment.

...

with the response from step #1, load the document content (PUT)

https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/LoadByChunkTrait.php#L24

...

repeat step#1 + step#2 (in a loop) until all your product/order/customers content has been uploaded

the chunk value is updated (as part of the iteration)
the same tm value must be used

make an HTTP POST call to load/bq endpoint

...

It will inform BQ to load the stored GCS content to your dataset

...

Optionally, if you are to use an advanced rest client , you can configure it as such:

...

Integration Strategy

doc_attribute, doc_attribute_value, doc_language can be synchronized with a call to the load endpoint
- because GCP has a size limit of 32MB for POST requests, do not use the /load service for big data exports
- if the data is too big, the service returns a 413 ENTITY TOO LARGE response. In this case, switch to the batch load
doc_product / doc_order / doc_user / doc_content / etc and other content bigger than 32 MB must be loaded in batches (max size of 32MB) or via GCS Public Signed URL /load/chunk endpoint
- NOTE: when using /load/chunk service to generate a Google Cloud Storage Public URL, there is no size limit! It is possible to load all data at once using one GCS Public URL link.

Load in Batches / data > 32 MB

For content over 32MB, we provide an endpoint to access a Signed GCS Url that would put all your streamed content into a file (currently there is no defined file size limit in GCS)

Read more about Google Cloud Signed URL https://cloud.google.com/storage/docs/access-control/signed-urls (response samples, uses, etc)

1. Make a request for public upload link

This is the generic POST request:

Code Block

curl --connect-timeout 60 --max-time 300 "https://boxalino-di-<mode>-krceabfwya-ew.a.run.app/load/chunk" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "dev: true|false" \
  -H "tm: YYYYmmddHHiiss" \
  -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \
  -H "mode: <mode>" \
  -H "chunk: <id>" \
  -H "doc: doc_product|doc_order|doc_user|doc_content|.." \
  -H "Authorization: Basic <encode of the account>"

The response will be an upload link that can only be used in order to create the file in the clients` GCS bucket. The link is valid for 30 minutes.

Note
NOTE: The HTTP headers must include a new property - chunk. The chunk value (number or textual) is the order of the batch / pagination / SEEK strategy used for content segmentation / etc. It is part of the final file available in GCS, to ensure that the Signed GCS Url is unique.

2. Upload the content on the public link

A code sample is available in our generic PHP library

Code Block
curl --connect-timeout 60 --timeout 0 <GCS-signed-url> \ -X PUT \ -H "Content-Type: application/octet-stream" \ -d "<YOUR DOCUMENT JSONL CONTENT (STREAM)>"

Panel

panelIconId	1f44c
panelIcon	:ok_hand:
panelIconText	👌
bgColor	#FFEBE6

The use of the header chunk is required if the same file/document is exported in batches/sections.
Repeat steps 1+2 for every data batch loaded in GCS.
Make sure to increment the value of the chunkproperty for each /load/chunk request.

Tip

Step #1 - #2 must be repeated for every batch that is required to be added for the given process (same tm, mode & type)

Only after all the data was loaded, you can move on to step#3.

3. Load data to BigQuery

Make an HTTP POST call to load/bq endpoint. A code sample is available in our generic PHP library

This is the generic POST request:

Code Block

curl --connect-timeout 60 --max-time 300 "https://boxalino-di-<mode>-krceabfwya-ew.a.run.app/load/bq" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "dev: true|false" \
  -H "tm: YYYYmmddHHiiss" \
  -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \
  -H "mode: <mode>" \
  -H "doc: doc_product|doc_order|doc_user|doc_content|.." \
  -H "Authorization: Basic <encode of the account>"

Info

The HTTP headers must have the required properties as described in the Request Definition

data-integration-doc-php/DiRequestTrait.php at 3.2.0 · boxalino/data-integration-doc-php

...

The /load , /load/chunk , /load/bq have a succesfull response on average in 10sec (so the wait time is minimal). Howether, we recommend for your fallback/retry policy, to use a timeout configuration of 3-5 min.

Based on your system, the timeout is represented by different parameters for the HTTP request.

...

The BigQuery dataset in which your account documents are loaded can be named after the data index and process it is meant to synchronize to:

if the request is for the dev data index - <client>_dev_<mode>
if the request is for the production data index - <client>_<mode>

, where:

<client> is the account name provided by Boxalino.
<mode> is the process F (for full), D (for delta), I (for instant)

Example, for our an account boxalino_client, the following datasets must exist (upon your integration use-cases):

for dev index: boxalino_client_dev_I, boxalino_client_dev_F, boxalino_client_dev_D
for production data index: boxalino_client_I, boxalino_client_F, boxalino_client_D

The above datasets must exist in your project.

...

Version	Old Version 16	New Version Current
Changes made by	Dana Negrescu	Dana Negrescu
Saved on	Feb 06, 2023	Dec 16, 2024

Versions Compared

Key

/load

Request Definition

/load

Request Definition

Integration Strategy

Load in Batches / data > 32 MB

Integration Strategy

Load in Batches / data > 32 MB

1. Make a request for public upload link

2. Upload the content on the public link

3. Load data to BigQuery

13		type	integration type (product, order, etc) https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/GcpRequestInterface.php#L22
14