Tip |
---|
At Boxalino we are strong advisors for ELT flows (vs ETL). The major benefit is speed, transparency and maintenence: data is loaded directly into a destination system (BQ), and transformed in-parallel (Dataform). |
...
Code Block |
---|
[ { "connector": { "type": "sftp|gcs|plentymarket|plytix|boxalino|shopify|wordpress", "options": { // specific for each connector type }, "load": { "options": { // for loading the data in BQ (BQ parameters) "format": "CSV|NEWLINE_DELIMITED_JSON", "field_delimiter": ";", "autodetect": true, "schema": "", "skipRows": 1(CSV)|0(JSONL), "max_bad_records": 0, "quote": "", "write_disposition": "WRITE_TRUNCATE", "create_disposition": "", "encoding": "", "allow_quoted_newlines" : 1, "allow_jagged_rows" : 0 }, "doc_X": { "property_node_from_doc_data_structure": [ { "source": "file___NODASHDATE__.jsonl", "name": "<used to create suffix for BQ table>", "schema": "" } ] } } }, "di": { "configuration": { "languages": [ "de", "fr" ], "currencies": [ "CHF" ], "mapping": { "languagesCountryCode": {"de":"CH_de", "fr":"CH_fr"} }, "default": { "language": "de", "currency": "CHF" } } } } ] |
...
The DI-SAAS SYNC request
The DI request will use the same headers (client, tm, mode, type, authorization)and a JSON request body that would provide mapping details between the loaded .jsonl files and data meaning.
...
Info |
---|
There should be a process within your own project that triggers the data sync between a 3rd party source (connector) and Boxalino. |
1 | Endpoint | production | |||
---|---|---|---|---|---|
12 | Action | /sync | |||
23 | Method | POST | |||
34 | Body | ||||
45 | Headers | Authorization | Basic base64<<DATASYNC API key : DATASYNC API Secret>> note: use the API credentials from your Boxalino account that have the ADMIN role assigned | ||
56 |
| Content-Type | application/json | ||
67 |
| client | account name | ||
78 |
| mode | data sync mode: F for full, D for delta, E for enrichments | ||
89 |
| type | product, user, content, user_content, order. if left empty - it will check for all tables with the given tm | ||
910 |
| tm | (optional) time , in format: YmdHis;
technical: used to identify the documents version | ||
1011 |
| ts | (optional) timestamp, must be millisecond based in UNIX timestamp | ||
1112 |
| dev | (optional) use this to mark the content for the dev data index | ||
1213 |
| (optional) if true, the data will only be loaded to BQ | |||
1314 |
| (optional) if true, the data will only be transformed (ex: dataform tag execution) |
...
Info |
---|
There should be a process within your own project that triggers the data sync between a 3rd party source (connector) and Boxalino. |
1 | Endpoint | production | |||
---|---|---|---|---|---|
12 | Action | /core | |||
23 | Method | POST | |||
34 | Body | ||||
45 | Headers | Authorization | Basic base64<<DATASYNC API key : DATASYNC API Secret>> note: use the API credentials from your Boxalino account that have the ADMIN role assigned | ||
56 |
| Content-Type | application/json | ||
67 |
| client | account name | ||
78 |
| mode | data sync mode: F for full, D for delta, E for enrichments | ||
89 |
| type | product, user, content, user_content, order. if left empty - it will check for all tables with the given tm | ||
910 |
| tm | (optional) time , in format: YmdHis;
technical: used to identify the documents version | ||
1011 |
| dev | (optional) use this to mark the content for the dev data index | ||
1112 |
Expand | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
For the integration bellow, the rti JSON payload is loaded in the client`s dataset.
|
...
Panel | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
The use of the header |
Tip |
---|
Step #1 - #2 must be repeated for every file that is required to be added for the given process (same tm, mode & type) Only after all the files are available in GCS, you can move on to step#3. |
Tip |
---|
After all required documents (doc) for the given |
...
The DI-SAAS flow can be used to access privately owned files in GCS and to update/create privately owned BigQuery resources as well.
Tip |
---|
The Service Account that manages this flow ( |
Google Cloud Storage GCS
If your connector is gcs and you want to push and read data to a privately owned resource, share access to boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com
or bi-with-ai.<client>@boxalino.com
(your Boxalino users group)
Required permissions (https://cloud.google.com/storage/docs/access-control/iam-roles ):
storage.objects.get
,storage.buckets.get
storage.objects.list
...
If the goal is to load the raw GCS file to a privately owned BQ resource (defined as: load
→ options
→ load_dataset_stage
), you have to ensure that the dataset EXISTS and it is in using location EU.
We recommend to create a dedicated dataset for this purpose. To avoid BQ storage costs, we recommend to set an expiration date for the tables (ex: 7 days)
The same Service Account boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com
or bi-with-ai.<client>@boxalino.com
(your Boxalino users group) must have read/write access to the dataset where the GCS data is loaded. For simplicity, BigQuery Admin is suggested.
Note |
---|
Share access only on the resources that Boxalino is allowed to access. For example, if your integrations only work on a temporary dataset - |
...