Tip |
---|
At Boxalino we are strong advisors for ELT flows (vs ETL). The major benefit is speed, transparency and maintenence: data is loaded directly into a destination system (BQ), and transformed in-parallel (Dataform). |
...
Code Block |
---|
[
{
"connector": {
"type": "sftp|gcs|plentymarket|plytix|boxalino",
"options": {
// specific for each connector type
},
"load": {
"options": {
// for loading the data in BQ (BQ parameters)
"format": "CSV|NEWLINE_DELIMITED_JSON",
"field_delimiter": ";",
"autodetect": true,
"schema": "",
"skipRows": 1(CSV)|0(JSONL),
"max_bad_records": 0,
"quote": "",
"write_disposition": "WRITE_TRUNCATE",
"create_disposition": "",
"encoding": ""
},
"doc_X": {
"property_node_from_doc_data_structure": [
{
"source": "file___NODASHDATE__.jsonl",
"name": "<used to create suffix for BQ table>",
"schema": ""
}
]
}
}
},
"di": {
"configuration": {
"languages": [
"de",
"fr"
],
"currencies": [
"CHF"
],
"mapping": {
"languagesCountryCode": {"de":"CH_de", "fr":"CH_fr"}
},
"default": {
"language": "de",
"currency": "CHF"
}
}
}
}
] |
...
The relevant data sources are available in .csv or JSONL (prefered) format
The files have a timestamp in the naming or in the access path (ex: product_20221206.jsonl)
this will help automating the integration
The files required to update a certain data type (ex: product, order, etc) are available in the same path
The files are available on an endpoint (SFTP, GCS, public3rd party API, public URL) to which Boxalino has access
for GCS/ GCP sources: access being shared to Boxalino`s Service Account
boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com
for AWS / SFTP : the client`s own AWS/SFTP environment with a Boxalino user & credentials must be provided
Expand | ||
---|---|---|
| ||
|
Expand | ||
---|---|---|
| ||
|
...
Panel | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
The The requirements specified above (#1-#4) are necessary if the data is accessed from a remote (outside Boxalino) scope. If your integration exports the data directly in Boxalino (as described https://boxalino-internal.atlassian.net/wiki/spaces/DOC/pages/2606792705/Boxalino+Data+Integration+DI-SAAS+-+ELT+Flow#Loading-content-to-Boxalino-GCS-(connector%3A-boxalino) ), please continue with the Data Transformation step. |
2. Data Transformation
...
Panel | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
It is possible to configure dynamic The supported variations are: For example, if there are 3 chunks for |
2. Data Transformation
Once the data is loaded in GCS and BQ, it is time to transform it in the necessary data structure.
...
The client has access to a GCP project
The client will create a Dataform repository https://cloud.google.com/dataform/docs/repositories
The client has access to a GitHub or GitLab repository (to connect it to the Dataform repository) https://cloud.google.com/dataform/docs/connect-repository
The client has given “Dataform Admin” permission to Boxalino Service Account
boxalino-dataform@rtux-data-integration.iam.gserviceaccount.com
The DI-SAAS SYNC request
The DI request will use the same headers (client, tm, mode, type, authorization)and a JSON request body that would provide mapping details between the loaded .jsonl files and data meaning.
...
Endpoint | production | ||||
---|---|---|---|---|---|
1 | Action | /sync | |||
2 | Method | POST | |||
3 | Body | ||||
4 | Headers | Authorization | Basic base64<<DATASYNC API key : DATASYNC API Secret>> note: use the API credentials from your Boxalino account that have the ADMIN role assigned | ||
5 |
| Content-Type | application/json | ||
6 |
| client | account name | ||
7 |
| mode | data sync mode: F for full, D for delta, E for enrichments | ||
8 |
| type | product, user, content, user_content, order. if left empty - it will check for all tables with the given tm | ||
9 |
| tm | (optional) time , in format: YmdHis;
technical: used to identify the documents version | ||
10 |
| ts | (optional) timestamp, must be millisecond based in UNIX timestamp | ||
11 |
| dev | (optional) use this to mark the content for the dev data index |
...
Panel | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
The use of the header |
Tip |
---|
Step #1 - #2 must be repeated for every file that is required to be added for the given process (same tm, mode & type) Only after all the files are available in GCS, you can move on to step#3. |
Tip |
---|
After all required documents (doc) for the given |
...