Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info

At this point, the following steps should already be achieved by the Data Integration (DI) team:

  1. Create the JSONL files https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/252149803/Data+Integration#1.-Export-JSONL-files

...

  1. saving your JSONL (for every data type required for the data sync type) in a Google Cloud Storage (GCS) Bucket from the Boxalino project

    1. the files are available for 30 days in the clients` GCS bucket

  2. loading every JSONL in a BigQuery table in Boxalino project

    1. for Instant Updates : the tables have a retention policy of 1 day

    2. for Delta Updates: the tables have a retention policy of 7 days

    3. for Full Updates: thetables have a retention policy of 30 days

...

  • doc_attribute, doc_attribute_value, doc_language can be synchronized with a call to the load endpoint

    • because GCP has a size limit of 32MB for POST requests, do not use the /load service for big data exports

  • doc_product / doc_order / doc_user must be loaded in batches / via public GCS link

    • GCP has a size limit of 32MB for POST requests

    • we recommend avoiding it by receiving a public GCS load URL (steps bellow)

...

  • (max size of 32MB) or via GCS Public Signed URL /load/chunk endpoint

Load in Batches / data > 32 MB

In order to upload the content in chunks or in big data sizes (GB of data), the following requests are required:

...

Info

The HTTP headers must have the required properties as described in the Request Definition

https://github.com/data-integration-doc-php/DiRequestTrait.php at 3.2.0 ยท boxalino/data-integration-doc-php/blob/3.0.0/src/Service/Flow/DiRequestTrait.php#L175

Fallback / retry policies

We do our best to ensure that our services are continuously available to receive requests. The endpoints have a timeout of 3600s. This means that your POST request will receive a response when the service finished your request (either error or succes) or when the timeout is reached.

The /load , /load/chunk , /load/bq have a succesfull response on average in 10sec (so the wait time is minimal). Howether, we recommend for your fallback/retry policy, to use a timeout configuration of 3-5 min.

Based on your system, the timeout is represented by different parameters for the HTTP request.

Code Block
curl --max-time 10 --connect-timeout 30 <SERVICE> -X POST -d <data>

--max-time is the timeout: wait 10s for a response from the server or stop connection
--connect-timeout is the connection timeout: wait 30s to make a connection to the service or stop connection

Instead of stop connection, we recommend to integrate a fallback policy with a retry.

Using private GCP resources for DI

...

The BigQuery dataset in which your account documents are loaded can be named after the data index and process it is meant to synchronize to:

  1. if the request is for the dev data index - <client>_dev_<mode>

  2. if the request is for the production data index - <client>_<mode>

, where:

  • <client> is the account name provided by Boxalino.

  • <mode> is the process F (for full), D (for delta), I (for instant)

Example, for our an account boxalino_client, the following datasets must exist (upon your integration use-cases):

  • for dev index: boxalino_client_dev_I, boxalino_client_dev_F, boxalino_client_dev_D

  • for production data index: boxalino_client_I, boxalino_client_F, boxalino_client_D

The above datasets must exist in your project.

...