Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

For the purpose of account launch and best-practices configuration, the following data must be synchronized fully once:

...

product

...

order (online or/and offline)

...

Tip

At Boxalino we are strong advisors for ELT flows (vs ETL). The major benefit is speed, transparency and maintenence: data is loaded directly into a destination system (BQ), and transformed in-parallel (Dataform).

The ELT flow is recommended when wanting to load your data as is (no transformation) in BQ and then perform the transformation logic (SQL) in Dataform. Integration-effort is minimal (push all data to GCS - Boxalino-owned or private client-owned, load data to BQ, write SQL in Dataform).

More information about the Integration Strategies is available on Confluence.
Integration Strategies

...

The Generic Data Integration (DI) Flow expects for the client, themselves to :

...

create each JSONL document

...

load to Boxalino GCS

...

load the files to Boxalino BQ

...

execute the steps described in Data Integration .

For certain platforms (Shopware6, Magento2), the step #1 can be done with a Boxalino Data Integration plugin.

...

Tip

As a integrator, you can do the steps #1-#4 Data Integration by following our recommended integration flowIntegration Flow

...

Code Block
[
  {
    "connector": {
      "type": "sftp|gcs|plentymarket|boxalino",
      "options": {
          // specific for each connector type
      },
      "load": {
        "options": {
          // for loading the data in BQ (BQ parameters)
          "format": "CSV|NEWLINE_DELIMITED_JSON",
          "field_delimiter": ";",
          "autodetect": true,
          "schema": "",
          "skipRows": 1(CSV)|0(JSONL),
          "max_bad_records": 0,
          "quote": "",
          "write_disposition": "WRITE_TRUNCATE",
          "create_disposition": "",
          "encoding": ""
        },
        "doc_X": {
          "property_node_from_doc_data_structure": [
            {
              "source": "file___NODASHDATE__.jsonl",
              "name": "<used to create suffix for BQ table>",
              "schema": ""
            }
          ]
        }
      }
    },
    "di": {
      "configuration": {
        "languages": [
          "de",
          "fr"
        ],
        "currencies": [
          "CHF"
        ],
        "mapping": {
          "languagesCountryCode": {"de":"CH_de", "fr":"CH_fr"}
        },
        "default": {
          "language": "de",
          "currency": "CHF"
        }
      }
    }
  }
]

1. Data Connector & Data Load

The access to the data is managed by the client. These are a few guidelines for naming patterns:

  1. The relevant data sources are available in .csv or JSONL (prefered) format

  2. The files have a timestamp in the naming or in the access path (ex: product_20221206.jsonl)

    1. this will help automating the integration

  3. The files required to update a certain data type (ex: product, order, etc) are available in the same path

  4. The files are available on an endpoint (SFTP, GCS, public) to which Boxalino has access

    1. for GCS/ GCP sources: access being shared to Boxalino`s Service Account boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com

    2. for AWS / SFTP : a Boxalino user & credentials are must be provided

Expand
titleGCS connector properties (with JSONL files)
Code Block
"connector": {
  "type": "gcs",
    "options": {
      "source": {
        "bucket": "<GCS-bucket-name>",
        "pattern": "path/to/data/__NODASHDATE__/"
      }
  },
  "load": {
    "options": {
      "format": "NEWLINE_DELIMITED_JSON",
      "autodetect": true,
      "schema": "",
      "skipRows": 0,
      "max_bad_records": 0,
      "write_disposition": "WRITE_TRUNCATE"
    },
    "doc_attribute": {
      "properties": [
        {
          "source": "file1___NODASHDATE__.jsonl"
        }
      ],
      ..
    },
    "doc_attribute_value": {
      "properties": [
        {
          "source": "file2___NODASHDATE__.jsonl"
        }
      ],
      ..
    },
    "doc_product": {
      "entity": [
        {
          "source": "file3___NODASHDATE__.jsonl",
          "autodetect": 0,
          "schema": "SCHEMA FOR LOAD TO BQ"
          "name": ""
        }
      ]
    }
}

...

Panel
panelIconIdatlassian-warning
panelIcon:warning:
panelIconText:warning:
bgColor#EAE6FF

The load configuration defines the properties for loading content to BQ.

The requirements specified above (#1-#4) are necessary if the data is accessed from a remote (outside Boxalino) scope.

If your integration exports the data directly in Boxalino (as described https://boxalino-internal.atlassian.net/wiki/spaces/DOC/pages/2606792705/Boxalino+Data+Integration+DI-SAAS+-+ELT+Flow#Loading-content-to-Boxalino-GCS-(connector%3A-boxalino) ), please continue with the Data Transformation step.

2. Data Transformation

Once the data is loaded in GCS and BQ, it is time to transform it in the necessary data structure.

...

test stage3

Endpoint

production

https://boxalino-di-saas-krceabfwya-ew.a.run.app

1

 Action

/

https://boxalino-di-transformer-stage-krceabfwya-ew.a.run.app

2

Action

/di

sync

2

Method

POST

43

Body

https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#DI-SAAS-Request-(Overview)

54

Headers

Authorization

Basic base64<<DATASYNC API key : DATASYNC API Secret>>

note: use the API credentials from your Boxalino account that have the ADMIN role assigned

65

 

Content-Type

application/json

76

 

client

account name

87

 

mode

data sync mode: F for full, D for delta

98

 

type

product, user, content, user_content, order.

if left empty - it will check for all tables with the given tm

109

 

tm

(optional) time , in format: YmdHis;

Note

if the data was loaded in Boxalino GCS (https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#Loading-content-to-Boxalino-GCS-(connector%3A-boxalino) ) - must re-use the tm from the load step

technical: used to identify the documents version

1110

 

ts

(optional) timestamp, must be millisecond based in UNIX timestamp

1211

 

dev

(optional) use this to mark the content for the dev data index

Expand
titleSample

This is a sample of a triggered full product sync (minimal data):

Code Block
curl "https://boxalino-di-transformersaas-krceabfwya-ew.a.run.app/disync" \
  -X POST \
  -d "[{\"connector\":{\"type\":\"gcs\",\"options\":{\"source\":{\"bucket\":\"my-account-bucket\",\"pattern\":\"boxalino/product/__NODASHDATE__/\"}},\"load\":{\"options\":{\"format\":\"NEWLINE_DELIMITED_JSON\",\"autodetect\":true,\"schema\":\"\",\"skipRows\":0,\"max_bad_records\":0,\"write_disposition\":\"WRITE_TRUNCATE\"},\"doc_product\":{\"entity\":[{\"source\":\"product.jsonl\",\"autodetect\":0,\"schema\":\"ProductId:INT64,Name:STRING,ProductTypeId:INT64,Sku:STRING,GroupId:STRING,Price:FLOAT64,SalePrice:FLOAT64,created:DATETIME\"}],\"product_relations\":[{\"source\":\"crosssell.jsonl\",\"name\":\"crosssell\"}],\"link\":[{\"source\":\"urlrecord.jsonl\"}]}}},\"di\":{\"configuration\":{\"languages\":[\"de\",\"fr\"],\"currencies\":[\"CHF\"],\"mapping\":{\"languagesCountryCode\":{\"de\":\"CH_de\",\"fr\":\"CH_fr\"}},\"default\":{\"language\":\"de\",\"currency\":\"CHF\"}}}}]" \
  -H "Content-Type: application/json" \
  -H "account: <boxalino-account-name>" \
  -H "mode: F" \
  -H "tm: 202303112000" \
  -H "type: product" \
  -H "Authorization: <base64_encode(api_key:api_secret)>" 

The request above created the following resources:

  1. GCS (raw data, as migrated from the connector)

    1. gs://prod_rtux-data-integration_<account>/product/202303112000/F_product.jsonl

    2. gs://prod_rtux-data-integration_<account>/product/202303112000/F_crossell.jsonl

    3. gs://prod_rtux-data-integration_<account>/product/202303112000/F_urlrecord.jsonl

  2. BQ (the T dataset - raw data as loaded from BQ)

    1. rtux-data-integration.<account>_T.202303112000-doc_product-entity

    2. rtux-data-integration.<account>_T.202303112000-doc_product-product_relations-crossell

    3. rtux-data-integration.<account>_T.202303112000-doc_product-link

  3. GCS (transformed doc_X JSONL)

    1. gs://prod_rtux-data-integration_<account>/doc_product_F_202303112000.jsonl

    2. gs://prod_rtux-data-integration_<account>/doc_language_F_202303112000.jsonl

  4. BQ (the F dataset - transformed data to doc_X data structure)

    1. rtux-data-integration.<account>_F.doc_product_F_202303112000

    2. rtux-data-integration.<account>_F.doc_language_F_202303112000

Expand
titleSample request with Boxalino connector (ex: data is already available in Boxalino GCP project, in client`s GCS bucket)
Code Block
[
  {
    "connector": {
      "type": "boxalino",
       "load": {
        "options": {
          "format": "NEWLINE_DELIMITED_JSON",
          "autodetect": true,
          "schema": "",
          "skipRows": 0,
          "max_bad_records": 0,
          "write_disposition": "WRITE_TRUNCATE"
        },
        "doc_product": {
          "entity": [
            {
              "source": "product.jsonl",
              "autodetect": 0,
              "schema": "ProductId:INT64,Name:STRING,ProductTypeId:INT64,Sku:STRING,GroupId:STRING,Price:FLOAT64,SalePrice:FLOAT64,created:DATETIME"
            }
          ],
          "product_relations": [
            {
              "source": "crosssell.jsonl",
              "name": "crosssell"
            }
          ],
          "link": [
            {
              "source": "urlrecord.jsonl"
            }
          ]
        }
      }
    },
    "di": {
      "configuration": {
        "languages": [
          "de",
          "fr"
        ],
        "currencies": [
          "CHF"
        ],
        "mapping": {
          "languagesCountryCode": {"de":"CH_de", "fr":"CH_fr"}
        },
        "default": {
          "language": "de",
          "currency": "CHF"
        }
      }
    }
  }
]

...

 }
  }
]

Loading content to Boxalino GCS (connector: boxalino)

Note

By following these steps, you can push your data (JSONL or CSV) directly in your client`s GCS bucket in the Boxalino scope. After all data has been loaded in GCS, the DI-SAAS request can be called https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#The-DI-request , assigning connector → type : boxalino.

There are 2 available flows, based on the size of your data:

...

Expand
titleSample

For example, the request bellow would create a gs://prod_rtux-data-integration_<account>/product/20230301161554/T_doc_language.json in your clients` GCS bucket. This is a necessary data for the type:product integration.

note
Code Block
curl --connect-timeout 30 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/transformer/load" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "tm: 20230301161554" \
  -H "type: product" \
  -H "mode: F" \
  -H "doc: doc_language.json" \
  -d "{\"language\":\"en\",\"country_code\":\"en-GB\",\"creation_tm\":\"2023-03-01 16:15:54\",\"client_id\":0,\"src_sys_id\":0}\n{\"language\":\"de\",\"country_code\":\"de-CH\",\"creation_tm\":\"2023-03-01 16:15:54\",\"client_id\":0,\"src_sys_id\":0}" \
  -H "Authorization: Basic <encode of the account>" 
Note

The same tm value must be used across your other requests. This identifies the timestamp of your computation process.

...

Panel
panelIconId1f44c
panelIcon:ok_hand:
panelIconText👌
bgColor#FFEBE6

The use of the header chunk is required if the same file/document is exported in batches/sections.
Repeat steps 1+2 for every data batch loaded in GCS.
Make sure to increment the value of the chunkproperty for each /transformer/load/url request.

Tip

Step #1 - #2 must be repeated for every file that is required to be added for the given process (same tm, mode & type)

Only after the full content is available in GCS, you can move on to step#3.

Tip

After all required documents (doc) for the given type data sync (ex: product, order, etc) have been made available in GCS, the DI-SAAS request can be called . https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#The-DI-request , assigning connector → type : boxalino.

Technical notes

  1. Every request to our nodes returns a response (200 or 400). This can be used internally for testing purposes as well.

  2. It is possible to review the STATUS of the requests at any given time Status Review

...