Tip |
---|
At Boxalino we are strong advisors for ELT flows (vs ETL). The major benefit is speed, transparency and maintenence: data is loaded directly into a destination system (BQ), and transformed in-parallel (Dataform). |
...
Data connector & Data load
downloads the files from remote source to GCS
loads the raw content to the transform dataset (<account>_T)*
*if
options
->load_dataset
&load_gcp_project
values are set - the loaded raw content will be available inload_gcp_project.load_dataset
instead ofrtux-data-integration.<account>_T
Data transformation (with logic/SQL provided by the client)
Steps #1-#4 from the Generic Flow
DI automation
...
Code Block |
---|
[ { "connector": { "type": "sftp|gcs|plentymarket|plytix|boxalino", "options": { // specific for each connector type }, "load": { "options": { // for loading the data in BQ (BQ parameters) "format": "CSV|NEWLINE_DELIMITED_JSON", "field_delimiter": ";", "autodetect": true, "schema": "", "skipRows": 1(CSV)|0(JSONL), "max_bad_records": 0, "quote": "", "write_disposition": "WRITE_TRUNCATE", "create_disposition": "", "encoding": "", "allow_quoted_newlines" : }1, "allow_jagged_rows" : 0 }, "doc_X": { "property_node_from_doc_data_structure": [ { "source": "file___NODASHDATE__.jsonl", "name": "<used to create suffix for BQ table>", "schema": "" } ] } } }, "di": { "configuration": { "languages": [ "de", "fr" ], "currencies": [ "CHF" ], "mapping": { "languagesCountryCode": {"de":"CH_de", "fr":"CH_fr"} }, "default": { "language": "de", "currency": "CHF" } } } } ] |
...
The relevant data sources are available in .csv or JSONL (prefered) format
The files have a timestamp in the naming or in the access path (ex: product_20221206.jsonl)
this will help automating the integration
The files required to update a certain data type (ex: product, order, etc) are available in the same path
The files are available on an endpoint (SFTP, GCS, 3rd party API, public URL) to which Boxalino has access
for GCS/ GCP sources: access being shared to Boxalino`s Service Account
boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com
must be given https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#Access-sharing-(BQ-and-GCS)for AWS / SFTP : the client`s own AWS/SFTP environment with a Boxalino user & credentials
Expand | ||
---|---|---|
| ||
|
...
Panel | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
It is possible to configure dynamic The supported variations are: For example, if there are 3 chunks for |
...
The client has access to a GCP project
The client will create a Dataform repository https://cloud.google.com/dataform/docs/repositories
The client has access to a GitHub or GitLab repository (to connect it to the Dataform repository) https://cloud.google.com/dataform/docs/connect-repository
The client has given “Dataform Admin” permission to Boxalino Service Account
boxalino-dataform@rtux-data-integration.iam.gserviceaccount.com
The DI-SAAS SYNC request
...
it will load the files to BQ in a T (transform) dataset (ex:
SELECT * FROM rtux-data-integration.<account>_T.<tm>_<doc_X>_<property>_<name>
)it will run the transform flow (step #1-#3 from Data Integration )
generating each doc_X JSONL content
loading the doc_X JSONL in your GCS bucket
loading the doc_X JSONL to BQ
it will run the SYNC request Sync Request for the process
loads the content in core tables in BQ
loads the content in the respective data index (for products)
REQUEST DEFINITION
As an integrator, please create the bellow request to the provided endpoint.
Info |
---|
There should be a process within your own project that triggers the data sync between a 3rd party source (connector) and Boxalino. |
...
Endpoint
...
production
...
https://boxalino-di-saas-krceabfwya-ew.a.run.app
...
Action
...
/sync
...
Method
...
POST
...
Body
...
...
Headers
...
Authorization
...
Basic base64<<DATASYNC API key : DATASYNC API Secret>>
note: use the API credentials from your Boxalino account that have the ADMIN role assigned
...
...
Content-Type
...
application/json
...
...
client
...
account name
...
...
mode
...
data sync mode: F for full, D for delta, E for enrichments
...
...
type
...
product, user, content, user_content, order.
if left empty - it will check for all tables with the given tm
...
...
tm
(optional) time , in format: YmdHis;
...
When using dataform
for transforming the exported data (your custom CSV/JSONL files) into Boxalino Data Structure, the JSON body (for the SYNC REQUEST) has the following information (next to connector
and di
):
Code Block |
---|
"transform": {
"vars": {
"var1": "value"
},
"tags": [
"order"
],
"dataform": {
"project": "rtux-data-integration",
"location": "europe-west1",
"repository": "boxalino-di-saas-elt",
"workspace": "dataform"
}
} |
Note |
---|
The values for |
The DI-SAAS SYNC request
The DI request will use the same headers (client, tm, mode, type, authorization)and a JSON request body that would provide mapping details between the loaded .jsonl files and data meaning.
it will load the files to BQ in a T (transform) dataset (ex:
SELECT * FROM rtux-data-integration.<account>_T.<tm>_<doc_X>_<property>_<name>
)it will run the transform flow (step #1-#3 from Data Integration )
generating each doc_X JSONL content
loading the doc_X JSONL in your GCS bucket
loading the doc_X JSONL to BQ
it will run the SYNC request Sync Request for the process
loads the content in core tables in BQ
loads the content in the respective data index (for products)
REQUEST DEFINITION
As an integrator, please create the bellow request to the provided endpoint.
Info |
---|
There should be a process within your own project that triggers the data sync between a 3rd party source (connector) and Boxalino. |
Code Block |
---|
curl "https://boxalino-di-saas-krceabfwya-ew.a.run.app/sync" \
-X POST \
-d "[{\"connector\":{\"type\":\"gcs\",\"options\":{\"source\":{\"bucket\":\"my-account-bucket\",\"pattern\":\"boxalino/product/__NODASHDATE__/\"}},\"load\":{\"options\":{\"format\":\"NEWLINE_DELIMITED_JSON\",\"autodetect\":true,\"schema\":\"\",\"skipRows\":0,\"max_bad_records\":0,\"write_disposition\":\"WRITE_TRUNCATE\"},\"doc_product\":{\"entity\":[{\"source\":\"product.jsonl\",\"autodetect\":0,\"schema\":\"ProductId:INT64,Name:STRING,ProductTypeId:INT64,Sku:STRING,GroupId:STRING,Price:FLOAT64,SalePrice:FLOAT64,created:DATETIME\"}],\"product_relations\":[{\"source\":\"crosssell.jsonl\",\"name\":\"crosssell\"}],\"link\":[{\"source\":\"urlrecord.jsonl\"}]}}},\"di\":{\"configuration\":{\"languages\":[\"de\",\"fr\"],\"currencies\":[\"CHF\"],\"mapping\":{\"languagesCountryCode\":{\"de\":\"CH_de\",\"fr\":\"CH_fr\"}},\"default\":{\"language\":\"de\",\"currency\":\"CHF\"}}}}]" \
-H "Content-Type: application/json" \
-H "account: <boxalino-account-name>" \
-H "mode: F" \
-H "tm: 202303112000" \
-H "type: product" \
-H "Authorization: <base64_encode(api_key:api_secret)>" |
The request above created the following resources:
...
GCS (raw data, as migrated from the connector)
gs://prod_rtux-data-integration_<account>/product/202303112000/F_product.jsonl
gs://prod_rtux-data-integration_<account>/product/202303112000/F_crossell.jsonl
gs://prod_rtux-data-integration_<account>/product/202303112000/F_urlrecord.jsonl
BQ (the T dataset - raw data as loaded from BQ)
...
Endpoint | production | ||
---|---|---|---|
1 | Action | /sync | |
2 | Method | POST | |
3 | Body | https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+ |
technical: used to identify the documents version
ts
(optional) timestamp, must be millisecond based in UNIX timestamp
dev
(optional) use this to mark the content for the dev data index
...
title | Sample |
---|
This is a sample of a triggered full product sync (minimal data):
4 | Headers | Authorization | Basic base64<<DATASYNC API key : DATASYNC API Secret>> note: use the API credentials from your Boxalino account that have the ADMIN role assigned | ||
---|---|---|---|---|---|
5 |
| Content-Type | application/json | ||
6 |
| client | account name | ||
7 |
| mode | data sync mode: F for full, D for delta, E for enrichments | ||
8 |
| type | product, user, content, user_content, order. if left empty - it will check for all tables with the given tm | ||
9 |
| tm | (optional) time , in format: YmdHis;
technical: used to identify the documents version | ||
10 |
| ts | (optional) timestamp, must be millisecond based in UNIX timestamp | ||
11 |
| dev | (optional) use this to mark the content for the dev data index | ||
12 |
| (optional) if true, the data will only be loaded to BQ | |||
13 |
| (optional) if true, the data will only be transformed (ex: dataform tag execution) |
Expand | ||
---|---|---|
| ||
This is a sample of a triggered full product sync (minimal data):
The request above created the following resources:
|
Expand | ||
---|---|---|
| ||
|
Loading content to Boxalino GCS (connector: boxalino)
Note |
---|
By following these steps, you can push your data (JSONL or CSV) directly in your client`s GCS bucket in the Boxalino scope. After all data has been loaded in GCS, the DI-SAAS request can be called https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#The-DI-request , assigning connector → type : boxalino. |
There are 2 available flows, based on the size of your data:
The content is exported as the
body
of yourPOST
requestThe content is exported with the help of a public GCS Signed URL (https://cloud.google.com/storage/docs/access-control/signed-urls )
Option #1 is recommended for data volume less than 32MB.
Option #2 is allowed for any data size.
The described steps can be followed if you want to export your data in Boxalino GCP project, in your clients` account.
The LOAD request will create a GCS file in your project`s GCS buckte:
|
The DI-SAAS CORE request
Use this flow if the purpose is to expose data -as is- from your system to Boxalino. The data will be available in the clients` _core
dataset in the Boxalino ecosystem.
In order to achieve this, there are 2 requests to be made:
load data to GCS https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#Loading-content-to-Boxalino-GCS-(connector%3A-boxalino)
load data from GCS to
<client>_core.<destination>
table
Note |
---|
The data payload (CSV or JSONL) must have a field to be used as primary field (ex: id). |
REQUEST DEFINITION
As an integrator, please create the bellow request to the provided endpoint.
Info |
---|
There should be a process within your own project that triggers the data sync between a 3rd party source (connector) and Boxalino. |
Endpoint | production | ||||
---|---|---|---|---|---|
1 | Action | /core | |||
2 | Method | POST | |||
3 | Body | ||||
4 | Headers | Authorization | Basic base64<<DATASYNC API key : DATASYNC API Secret>> note: use the API credentials from your Boxalino account that have the ADMIN role assigned | ||
5 |
| Content-Type | application/json | ||
6 |
| client | account name | ||
7 |
| mode | data sync mode: F for full, D for delta, E for enrichments | ||
8 |
| type | product, user, content, user_content, order. if left empty - it will check for all tables with the given tm | ||
9 |
| tm | (optional) time , in format: YmdHis;
technical: used to identify the documents version | ||
10 |
| dev | (optional) use this to mark the content for the dev data index | ||
11 |
Expand | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
For the integration bellow, the rti JSON payload is loaded in the client`s dataset.
|
Loading content to GCS (connector: boxalino, gcs)
Note |
---|
By following these steps, you can push your data (JSONL or CSV) directly in your client`s GCS bucket in the Boxalino scope or in a GCS bucket in a private GCP project (access share required). After all data has been loaded in GCS, the DI-SAAS request can be called https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#The-DI-request , assigning connector → type : boxalino or connector → type: gcs |
There are 2 available flows, based on the size of your data:
The content is exported as the
body
of yourPOST
requestThe content is exported with the help of a public GCS Signed URL (https://cloud.google.com/storage/docs/access-control/signed-urls )
Option #1 is recommended for data volume less than 32MB.
Option #2 is allowed for any data size.
The described steps can be followed if you want to export your data in Boxalino GCP project, in your clients` account.
The LOAD request will create a GCS file in your project`s GCS buckte: gs://prod_rtux-data-integration_<account>/<type>/<tm>/<mode>_<doc>
there is no content validation at this step
Panel | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
When loading data to be transformed, make sure to use the |
A. Loading content less than 32 MB
Files under 32MB can be loaded directly as REQUEST BODY CONTENT in Boxalino`s GCS.
The sample request bellow would create the languagefeed_20220317.csv in your GCS bucket gs://prod_rtux-data-integration_<account>/<type>/<tm>/<mode>_<filename>
Code Block |
---|
curl "https://boxalino-di-process-krceabfwya-ew.a.run.app/saas/load" \
-X POST \
-H "Content-Type: application/json" \
-H "client: <account>" \
-H "dev: true|false" \
-H "tm: YYYYmmddHHiiss" \
-H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \
-H "mode: T" \
-H "gcp_project: <private-client-GCP-project>" \
-H "gcs_bucket: <private-client-GCS-bucket>" \
-H "doc: <filename>" \
-d "<JSONL>" \
-H "Authorization: Basic <encode of the account>" |
Tip |
---|
Step #1 must be repeated for every file that is required to be added for the given process (same tm, mode & type) |
Expand | ||
---|---|---|
| ||
For example, the request bellow would create a |
...
|
...
|
...
there is no content validation at this step
A. Loading content less than 32 MB
Files under 32MB can be loaded directly as REQUEST BODY CONTENT in Boxalino`s GCS.
The sample request bellow would create the languagefeed_20220317.csv in your GCS bucket gs://prod_rtux-data-integration_<account>/<type>/<tm>/<mode>_<filename>
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
curl
|
Tip |
---|
Step #1 must be repeated for every file that is required to be added for the given process (same tm, mode & type) |
Expand | ||
---|---|---|
| ||
Note | ||
The same |
Warning |
---|
If the service response is an error like: |
B. Loading undefined data size
For content over 32MB, we provide an endpoint to access a Signed GCS Url that would put all your streamed content into a file (currently there is no defined file size limit in GCS)
Read more about Google Cloud Signed URL https://cloud.google.com/storage/docs/access-control/signed-urls (response samples, uses, etc)
1. Make a request for public upload link
This is the generic POST request:
Code Block |
---|
curl --connect-timeout 3060 --max-time 300 "https://boxalino-di-stage process-krceabfwya-ew.a.run.app/transformer saas/load" \ -X POST \ -H "Content-Type: application/json" \ -H "client: <account>" \ -H "tm: 20230301161554" \ -H "type: product /url" \ -H X"mode: F" POST \ -H "doc Content-Type:doc_language. application/json" \ -d H "{\"language\":\"en\",\"country_code\":\"en-GB\",\"creation_tm\":\"2023-03-01 16:15:54\",\"client_id\":0,\"src_sys_id\":0}\n{\"language\":\"de\",\"country_code\":\"de-CH\",\"creation_tm\":\"2023-03-01 16:15:54\",\"client_id\":0,\"src_sys_id\":0} client: <account>" \ -H "dev: true|false" \ -H "tm: YYYYmmddHHiiss" \ -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \ -H "mode: T" \ -H "gcp_project: <private-client-GCP-project>" \ -H "gcs_bucket: <private-client-GCS-bucket>" \ -H "chunk: <id>" \ -H "Authorization doc:Basic <filename>"<encode \of the account>" |
Note |
---|
The same |
Warning |
---|
If the service response is an error like: |
B. Loading undefined data size
For content over 32MB, we provide an endpoint to access a Signed GCS Url that would put all your streamed content into a file (currently there is no defined file size limit in GCS)
Read more about Google Cloud Signed URL https://cloud.google.com/storage/docs/access-control/signed-urls (response samples, uses, etc)
1. Make a request for public upload link
...
-H "Authorization: Basic <encode of the account>" |
The response will be an upload link that can only be used in order to create the file in the clients` GCS bucket. The link is valid for 30 minutes.
Expand | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||
Lets say, we need a public link to upload a document called product-entity.jsonl. The following request can be used to generate the link:
|
2. Upload the content on the public link
Code Block |
---|
curl --connect-timeout 60 --timeout 0 <GCS-signed-url> \ -X PUT \ -H "docContent-Type: <filename>application/octet-stream" \ -d "<YOUR DOCUMENT JSONL -H "Authorization: Basic <encode of the account>" |
The response will be an upload link that can only be used in order to create the file in the clients` GCS bucket. The link is valid for 30 minutes.
Expand | ||
---|---|---|
| ||
Lets say, we need a public link to upload a document called product-entity.jsonl. The following request can be used to generate the link:
|
2. Upload the content on the public link
Code Block |
---|
curl --connect-timeout 60 --timeout 0 <GCS-signed-url> \
-X PUT \
-H "Content-Type: application/octet-stream" \
-d "<YOUR DOCUMENT JSONL CONTENT (STREAM)>" |
...
panelIconId | 1f44c |
---|---|
panelIcon | :ok_hand: |
panelIconText | 👌 |
bgColor | #FFEBE6 |
...
Tip |
---|
Step #1 - #2 must be repeated for every file that is required to be added for the given process (same tm, mode & type) Only after all the files are available in GCS, you can move on to step#3. |
Tip |
---|
After all required documents (doc) for the given |
Technical notes
...
Every request to our nodes returns a response (200 or 400). This can be used internally for testing purposes as well.
...
CONTENT (STREAM)>" |
Panel | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
The use of the header |
Tip |
---|
Step #1 - #2 must be repeated for every file that is required to be added for the given process (same tm, mode & type) Only after all the files are available in GCS, you can move on to step#3. |
Tip |
---|
After all required documents (doc) for the given |
Technical notes
Every request to our nodes returns a response (200 or 400). This can be used internally for testing purposes as well.
It is possible to review the STATUS of the requests at any given time Status Review
Access sharing (BQ and GCS)
The DI-SAAS flow can be used to access privately owned files in GCS and to update/create privately owned BigQuery resources as well.
Tip |
---|
If your project already shared access to your Boxalino users group, |
Google Cloud Storage GCS
If your connector is gcs and you want to push and read data to a privately owned resource, share access to boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com
or bi-with-ai.<client>@boxalino.com
(your Boxalino users group)
Required permissions:
storage.objects.get
,storage.buckets.get
storage.objects.list
This is achieved through Storage Legacy Bucket Reader & Storage Legacy Object Reader roles.
The gcs bucket used in the connector definition, must be located in europe-west6 or eu zone.
If you use the Boxalino services to load data to your private GCS bucket (<endpoint>/saas/load or <endpoint>/saas/load/url ), you must also give WRITE permissions to the SA.
...
BigQuery Access (BQ)
If the goal is to load the raw GCS file to a privately owned BQ resource (defined as: load
→ options
→ load_dataset
), you have to ensure that the dataset EXISTS and it is using location EU.
We recommend to create a dedicated dataset for this purpose. To avoid BQ storage costs, we recommend to set an expiration date for the tables (ex: 7 days)
The same Service Account boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com
or bi-with-ai.<client>@boxalino.com
(your Boxalino users group) must have read/write access to the dataset where the GCS data is loaded. For simplicity, BigQuery Admin is suggested.
Note |
---|
Share access only on the resources that Boxalino is allowed to access. For example, if your integrations only work on a temporary dataset - |
...