How to orient your Company towards Data Science?

Yes, we want to do Data Science!

You decided to begin your journey towards Data Science and consider (or already decided) to contract or hire a Data Scientist? Great! But how can you get started? What results can you expect and by when? What should be the first priorities of the Data Scientist? What types of instructions are needed to get maximum results?

First, it is important to remind ourselves what Data Science actually is: an intersection discipline

Data Science combines IT, Math and Business all together.

The first questions to ask yourself are:

  • Does my Data Scientist have what’s needed set-up by the IT? (if your e-shop is already driven by Boxalino, the answer is a quick YES

).

  • How can we share the Domains/Business Knowledge quickly so that the Data Scientist understands enough and starts efficiently?

 

But let’s start with the beginning, how to select the right Data Scientist.

What are the different types of Data Scientists?

Here are the differences between the roles of a data engineer, ML engineer, data scientist, and data analyst:

  1. Data Engineer:

    • Focus: Data infrastructure and pipelines.

    • Responsibilities:

      • Designing, building, and maintaining data systems and databases.

      • Developing and implementing data pipelines to efficiently process and transform data.

      • Ensuring data quality, reliability, and security.

      • Collaborating with other teams to understand data requirements and optimize data flow.

    • Skills:

      • Strong programming skills (e.g., Python, SQL).

      • Knowledge of data warehousing and ETL (Extract, Transform, Load) processes.

      • Experience with big data technologies (e.g., Hadoop, Spark).

      • Familiarity with data modeling and database design.

      • Understanding of cloud platforms and distributed computing.

  2. ML Engineer (Machine Learning Engineer):

    • Focus: Deploying and maintaining machine learning models.

    • Responsibilities:

      • Developing and implementing machine learning models and algorithms.

      • Integrating models into production systems and applications.

      • Optimizing and scaling models for performance and efficiency.

      • Collaborating with data scientists and software engineers to bridge the gap between research and production.

    • Skills:

      • Strong programming skills (e.g., Python, R).

      • Expertise in machine learning frameworks and libraries (e.g., TensorFlow, PyTorch, scikit-learn).

      • Knowledge of software engineering principles and best practices.

      • Understanding of data preprocessing and feature engineering.

      • Familiarity with cloud platforms and distributed computing.

  3. Data Scientist:

    • Focus: Extracting insights and building predictive models.

    • Responsibilities:

      • Analyzing complex datasets to identify patterns, trends, and insights.

      • Developing statistical models and machine learning algorithms.

      • Conducting experiments and A/B tests to drive data-driven decision-making.

      • Communicating findings to stakeholders through visualizations and reports.

    • Skills:

      • Strong background in mathematics, statistics, and data analysis.

      • Proficiency in programming languages (e.g., Python, R, MATLAB).

      • Experience with statistical modeling and machine learning techniques.

      • Knowledge of data visualization tools and storytelling.

      • Domain expertise and strong business acumen.

  4. Data Analyst:

    • Focus: Analyzing and interpreting data to support decision-making.

    • Responsibilities:

      • Collecting, cleaning, and transforming data for analysis.

      • Performing exploratory data analysis and generating descriptive statistics.

      • Creating dashboards, reports, and visualizations to present findings.

      • Collaborating with stakeholders to understand business requirements.

    • Skills:

      • Proficiency in SQL and data querying.

      • Familiarity with data analysis tools (e.g., Excel, Tableau).

      • Basic programming skills (e.g., Python, R).

      • Knowledge of statistical concepts and analysis techniques.

      • Strong problem-solving and communication skills.

While these roles share some overlapping skills and responsibilities, each has its own distinct focus and expertise within the data ecosystem.

How to select the right profile for your Data Scientist?

A good way to separate the different types of Data Science, is to split the Data Science Pyramid of needs (on the right) in 3 Data Scientist profiles:

  1. Data Engineer

  2. Data Science Analyst

  3. Core Data Scientist

As Boxalino can be your key partner covering most of your Data Engineering needs, we recommend you to focus the work of your Data Scientist on the middle part : Data Science Analytics.

To do advanced AI and Deep Learning sounds great (and it is great), but it is rarely where the low hanging fruits of Data Science are for your business, at least not at first.

If you need help understanding the different Data Science profiles and what it means to select the right Data Scientist for your business, don’t hesitate to contact Boxalino for guidance!

Start small, but start now!

Investing in Data Science will compound over time and what matters the most is when you start working on meaningful projects, not how much money you are investing at which speed.

As in the example of compound interest on the left, what matters is to start early and to do it steadily.

Once you will have done it for a few months, and then a few years, you will look back and see how it would be almost impossible to get to the same state quickly, even when investing much more resources. So start small, but start as early and steadily as you can!

Establish an efficient Data Warehouse Environment

The first important thing is to make sure it is possible for your Data Scientist to work efficiently, and this mainly relies on 4 aspects:

  1. How can your Data Scientist access good quality data?

  2. How and where the developments made by your Data Scientist will be deployed?

  3. How will the reports generated by your Data Scientist be accessed and shared within your company?

  4. How will the outcomes of your Data Scientist developments be integrated into your operations?

Give your Data Scientist access to good quality data

Why is it important?

In general, as shown in this diagram on the left, Data Scientists spent most of their time (79%) in what is called Data Engineering (collecting and cleaning data).

While you want your Data Scientist to focus on important business imperatives quickly and efficiently, it is not possible to avoid addressing this issue first. Luckily, you might be already ahead if you use Boxalino.

If your e-shop is already driven by Boxalino Real-Time platform, it means you automatically can have access to a well structured data warehouse in Google BigQuery, populated with data from your e-shop, including products, customers, transactions and online behaviors, all ready to be accessed and analyzed. This will make a massive cut in these scary 79%!

The Boxalino module to access your Data Warehouse is called Open Data Warehouse (CODW), contact your Boxalino’s Account Manager to learn more about how to activate your access.

In addition, to integrate 3rd party data into your Data Warehouse (from your ERP, PIM, CRM, …), follow our guide : Integrate your Data in Boxalino BigQuery Data Science Eco-System

Define the environment where your Data Science developments will run

If you are using the Open Data Warehouse (CODW) of Boxalino, it means all your data warehouse is already working in the Google Cloud Platform (GCP). We therefore highly recommend to use the GCP as your Data Science environment for all activities:

  • BigQuery for relational data

  • Google Storage for files

  • Compute Engine (Virtual Machines) for running your applications

  • Airflow to orchestrate your workflow

Chose the right Reporting environment

To get started, we recommend you to use Google Data Studio, because it is not adding any costs, is easy to use, doesn’t require advanced technical skill and is well integrated with Google BigQuery.

However, over time, you might want to consider other systems, as Data Studio comes with some limitation (e.g.: it’s doesn’t have an API). But while alternatives will be more powerful, they will likely be more costly and require more development efforts.

Ensure the outcomes of Data Science are easily and quickly actionable

The most important requirement is that all the data outcomes are stored in Google BigQuery, this will make the integration and deployment in Boxalino Real-Time platform seamless, quick & pain-free.

If your Data Science outcomes are not only data, but also predictive models from machine learning, Boxalino also supports the loading of trained predictive model in the PMML format.

Boxalino module giving you access to quickly deploy your Data Science outcomes into Boxalino Real-Time Platform is called Open Data Science Lab (CDSL), contact your Boxalino’s Account Manager to learn more about how to activate it.

Give efficient access to your Domains/Business Knowledge

To understand all the specialties of your business will take some time. A common mistake is to assume that a Data Scientist you consider great because “smart in math” (so good in the “Math and Statistics” section above) will automatically “get it” when it comes to your Domains/Business Knowledge.

If you are contracting an external Data Scientist

Even if you might do several projects over time, we recommend you to keep the warm-up phase minimal and focus on very specific knowledge needed for the project. If well prepared one complete briefing on the specifics can be sufficient and should be complemented with regular meetings during the project.

If you are hiring an internal Data Scientist

In this case, we suggest you to plan regular meetings with your management and operational teams (marketing, sales, …) to make sure that:

  1. Repetitive information are provided about your business and what is important for you

  2. Feed-backs are quickly provided (not if, but) when your new Data Scientist misunderstood details important for your business

  3. Share not only on your top-level ideas (which are of course important) but also spend enough time exchanging on a some level of details about your operations, how the outcome of the Data Science projects could be used, what can and cannot be easily automated, what are the issues you hope to see resolved or at least reduced, etc.

Define and prioritize your Data Science Focus Areas

Refrain from giving only high-level goals as guidelines to your new Data Scientist (like “we should do whatever is most impactful, fast”), while this is true and important, it will not be particularly helpful.

Start by defining a list of strategic priorities, each connected to a specific Focus Area.

This is a part that should be defined directly (or at least be strongly influenced and validated) by the management of the company. Therefore a Focus Area should not go too much in the details of the operations, but should not either be limited to a high-level strategical goal definition.

How to define your Focus Areas?

You can take different perspectives to define Focus Areas, all of them are complementary, so explore each of them and see how you get (or not) to overlaps in the results.

The UX perspective - Per WPO (Widget & Page Optimizer) Widget

Define what Data Science could help you do better in each of your important WPO Widgets: WELCOME (recommendations on start pages), PROMOTE (banner on the start pages), SEARCH (search and navigation product listings), UP/CROSS-SELLING recommendations, INDIVIDUALIZE (content of your sales & marketing communications), RE-TARGET (targeting of your sales & marketing communications), READ (content marketing inside your e-shop with blogs, magazine, and other types of contents), …

The Content Marketing perspective

Look at all the different content marketing activities you have and how they could be improved.

For instance:

  • How do you make sure your communication is addressing the right topics?

  • How to you reach the audience with the highest interest for these topics?

  • How much time do you spend in creating content versus simply copying the same content in different environments with slightly different formats (e-mail, landing page, ads, etc.)?

  • How do you judge after-wise if a campaign was worthwhile and on which aspects it could be improved?

The Supplier perspective

Each of your supplier is more or less valuable to you depending on many factors:

  • sold products / turnover / profit

  • quality of product data

  • brand awareness

  • attractiveness or novelties, promotions and campaigns

  • speed of delivery and other logistic factors

  • etc.

Understanding better what make your suppliers contribute or not to your success as well as identifying the key opportunities to increase it can have a great effect on your business, and of course it is also the opportunity to build even better relationship with them.

The Logistic perspective

Predicting what you will sell in the future might be a great value to optimize your business, especially if you include not only the propagation of the averages or of the new trends, but also predicting the likely effects of future promotions.

Running an efficient and lean logistic operations is key both for the smoothness of your operations, but also for your customer satisfaction.

The Conversion Rate & AOV perspective

Be careful here, if to increase your Conversion Rate or your AOV is your goal, it doesn’t belong in this perspective. It should be instead the way to judge the value of projects defined by other perspectives.

However, Conversion Rate & AOV can be a perspective as well, if they are the source (or main dimensions) of a Data Science idea, for example:

  1. If I remove, hide or make less visible the 20% segment of my products with the lowest PDP Conversion Rate, could I increase my Session Conversion Rate?

  2. How can I change the incentives of reaching different levels of Basket values?

  3. Is it better to show products which are out of stock in the navigation (or search) product listings, or should they be hidden (totally, or by default)?

The Price (& Profit) perspective

What product and deals should be at which price, at what time and for whom?

Data Science can greatly help you define the right price to achieve your revenue and profit goals weather it is a fixed price for the future, a dynamic price adjusted to your competitors or a personalized price with deals (vouchers and bundles) made to attract specific segments of your clientele.

The Customer Acquisition & Retention perspective

How much do you spend on which channel is something you can already do without a Data Scientist, but there are many perspectives you might want and need to understand deeper and better (not counting that you simply might want to have more reliable attribution numbers as an aggregate).

But understanding the value of channels not only to get sales, but to ensure most of them are becoming returning customers, can greatly help you to get a better return of your online marketing spendings.

The Customer Journey perspective

Modelling and understanding the customer journey can bring many advantages to your business, not least of which to understand and activate better the value of your content marketing strategy.

Bringing a lot of visitors to read your blog and other editorial content might not bring much direct conversions, but might be critical nevertheless. Learning more about the pathways from a content reader to a loyal customers are key and are not well covered by your typical web analytics.

The Customer Satisfaction perspective

What are the key factors which affect your customers satisfaction? Do you model them at all? Do you act on your findings?

While sometimes a long or problematic delivery can kill your customer satisfaction, other simpler things like a big discount started and noticed one day after a large purchase which might be a cause of frustration.

Integrating smart surveys with Big Behavioral Data can open the doors of a better understanding of your customer’s real opinions and feelings and their impact as well as how they can evolve.

The Customer Lifetime Value (CLV) perspective

What is the value you get out of a physical customer over all time is one of the favorite areas of Data Science in E-Commerce, partially because it can enable alternative business models to become more controlled, planned and effective (e.g.: subscription models). While the CLV is a good and common place to focus Data Science, make sure that you orient your project to have actionable outcomes quickly, because understanding the CLV doesn’t necessarily give you the insight how to increase it.

 

As a result, compare what you get out of the different perspectives and see, through the overlaps and repetition what comes clearly on top of your list.

What should be your action plan?

To prepare the start of your work with your Data Scientist, consider the following preparation points:

  1. Select a Data Scientist who is a good match for the middle profile : Data Science Analytics

  2. Make sure you are well set-up with Boxalino so that your Data Scientist can work in a good environment

  3. Plan regular meetings for your Data Scientist with management, operational team and Data Warehousing experts (typically from Boxalino)

  4. Define yourself 5-10 Focus Areas

  5. Prioritize these Focus Areas by giving them a score and rank them according to this score

  6. Limit your projects to the top priorities
    If you are contracting an external Data Scientist:

    • Chose only 1-2 Focus Areas for a project of a maximum duration of 2 months (you can extend later)

    If you are hiring an internal Data Scientist:

    • Limit the projects in the first 2-3 months of your Data Scientist only to the top 2-3 Focus Areas

  7. Explain in some level of details what is important for these 2-3 top Focus Areas in a document (consider 2-3 pages per Focus Area) and provide it as part of an initial briefing to your Data Scientist

ANNEX #1 - Example Focus Area

Here is a list of example Focus Areas.

This is neither complete nor constitute a recommendations from Boxalino. It is only an illustration to help you visualize what such a list could look like.

In addition to the Actionability, consider the following additional columns:

  • Strategic importance

  • Risk

  • Short-term vs Long-term impact

  • Expected effects

  • Probability of success

Focus Areas

Description

Actionability

Focus Areas

Description

Actionability

1

Sales Trends: Best-sellers to Long-Tail

By understanding better the long tail of your products and its recent changes, you can identify the needles in the haystack which should make their way to the top of your communication and promotions

high

2

Product demand/supply

Are there products people want (search, navigate to, select even if they are out of stock, …) which they can’t find at all or not with the characteristics they want? Are there products which are in your collection, but don’t bring anything and could be removed or at least hidden?

high

3

Product success DNA

What are the product characteristics which are the most correlated to high sales, click-through and conversion rate? (is a long description helping? what about 5 star ratings? …)

high

4

Marketing Campaigns success criteria

What are the common denominators of a successful marketing campaign? What is the right combination of timing, topic, visuals and targets? How can these findings help your marketing team define your next campaigns better?

high

5

Under/Over-communicated trends and topics

By detecting trends (and more specifically changes of trends) in your sales, you can identify what you are not focusing on enough in your marketing promotions and communications and what gets more coverage than deserves.

medium

6

Visitors Dead-Ends and Sunny-Paths

Where do your visitor typically end their visit? not only type of page (PDP, search result page, …) but also quality of page (e.g.: product pages with price > 100.- , or with poor description, or pages without any recommendations to go to the next step, ….). What qualities are, in reverse, bringing people to high conversions?

medium

7

Differences between new and existing customers

Are there significant difference between products and offerings a first time buyer versus what your loyal customers are purchasing?

medium

8

Customer profiling

Are there any new customer attribute you could create (such as clustering, calculated attributes, …) which could help you target and personalize your communication better?

medium

9

Mutli-visit channels impact

Each traffic source gives you interesting (and partially reliable) attributions of the value of their visitors, but what about the mix of these traffic sources used by the same customers?

medium

10

Customer journey steps

What are the key customer journey steps and how can you monitor them?

medium

11

Customer journey patterns

What are the most frequent and important repeated patterns of behaviors in your customer journeys?

medium

12

Supply & Logistics Optimization

How can you better predict your demand and match it with optimal logistic supply and good promotional deals?

medium

13

Customer retention

How do you model and monitor your customer retention? And how can you follow it up over time (e.g.: Cohort Analysis)?

medium-low

14

Customer re-activation

What are the key aspects making lapsed customers become active buyers again?

medium-low

15

Customer Lifetime Value

How do you model, monitor and optimize your Customer Lifetime Value?

medium-low