Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info

At this point, the following steps should already be achieved by the Data Integration (DI) team:

  1. Create the JSONL files https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/252149803/Data+Integration#1.-Export-JSONL-files

...

  1. saving your JSONL (for every data type required for the data sync type) in a Google Cloud Storage Bucket from the Boxalino project

    1. for Instant Updates : the files have a retention policy of 1 day

    2. for Delta Updates: the files have a retention policy of 7 days

    3. for Full Updates: the files have a retention policy of 30 days

  2. loading every JSONL in a BigQuery table in Boxalino project

Boxalino will provide access to the data Storage Buckets and BigQuery datasets that store your data.

...

The BigQuery dataset in which your account documents are loaded can be named after the data index and process it is meant to synchronize to:

  1. if the request is for the dev data index - <client>_dev_<mode>

  2. if the request is for the production data index - <client>_<mode>

, where:

  • <client> is the account name provided by Boxalino.

  • <mode> is the process F (for full), D (for delta), I (for instant)

Example, for our an account boxalino_client, the following datasets must exist (upon your integration use-cases):

  • for dev index: boxalino_client_dev_I, boxalino_client_dev_F, boxalino_client_dev_D

  • for production data index: boxalino_client_I, boxalino_client_F, boxalino_client_D

The above datasets must exist in your project.

...

Endpoint

full data sync

https://boxalino-di-full-krceabfwya-ew.a.run.app

1

delta data sync

https://boxalino-di-delta-krceabfwya-ew.a.run.app

2

instant-update data sync

https://boxalino-di-instant-update-krceabfwya-ew.a.run.app

3

stage / testing

https://boxalino-di-stage-krceabfwya-ew.a.run.app

4

Action

/load

5

Method

POST

6

Body

the document JSONL

7

Headers

Authorization

Basic base64<<DATASYNC API key : DATASYNC API Secret>>

8

Content-Type

application/json

9

project

(optional) the project name where the documents are to be stored;

10

dataset

(optional) the dataset in which the doc_X tables must be stored;


if not provided, the service will check the <index>_<mode> dataset in the Boxalino project, to which you will have access

11

bucket

(optional) the storage bucket where the doc_X will be loaded;

if not provided, the service will use the Boxalino project.

12

doc

the document type (as described in Data Structure )

13

client

your Boxalino account name

14

dev

only add it if the dev data index must be updated

15

mode

D for delta , I for instant update, F for full

technical: the constants from the GcpClientInterface https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/GcpRequestInterface.php#L18

16

tm

time, in format: YmdHis
requirement: the same tm value must be used from the begging of the DI process until the end, for all files.

technical: used to identify the documents versionversion of the documents and create the content.

17

ts

timestamp, must be millisecond based in UNIX timestamp

technical: calculated from the time the data has been exported to BQ; the update requests will be applied in version ts order; https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/DiRequestTrait.php#L140

18

chunk

(optional) for loading content by chunks

19

type

(optional) integration type (product, order, etc)
https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/GcpRequestInterface.php#L22

19

chunk

(optional) for loading content by chunks
see https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/415432770/Load+Request#Load-By-Chunks

A LOAD REQUEST code-sample is available in the data-integration-doc-php library: https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/LoadTrait.php

...

  • doc_attribute, doc_attribute_value, doc_language can be synchronized with a call to the load endpoint

  • doc_product can / doc_order / doc_user must be loaded in batches (especially when synchronized in F/full mode/ via public GCS link

    • GCP has a size limit of 256MB for POST requests

    • we recommend avoiding it by receiving a public GCS load URL (steps bellow)

Load By Chunks

In order to upload the content in chunks, the following requests are required:

  1. make an HTTP POST call to /load/chunk endpoint .

    1. The HTTP headers must include a new property - chunk.

    1. It will return a public URL which can be

      1. the chunk value (number or textual) can be the order of the batch / pagination / SEEK strategy used for content segmentation / etc

    2. This endpoint returns a public URL. This is used to load content https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/GcsLoadUrlTrait.php The HTTP headers must include a new property - chunk, with the number of the load

      1. the received link is generated by GCS. It will be unique per each loaded segment.

  2. with the response from step #1, load the document content (PUT)

    1. https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/LoadByChunkTrait.php#L24

  3. repeat step#1 + step#2 (in a loop) until all your product/order/customers content has been uploaded

    1. the chunk value is updated (as part of the iteration)

    2. the same tm value must be used

  4. make an HTTP POST call to load/bq endpoint

    1. It will inform BQ to load the stored GCS content to your dataset

    2. https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/LoadBqTrait.php

Info

The HTTP headers must have the required properties as described in the Request DefintionDefinition


https://github.com/boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/DiRequestTrait.php#L175

...