Building Kapiva's Data Infrastructure
Introduction
Kapiva, a rapidly expanding e-commerce brand, faced a formidable data management challenge. Their data team was grappling with a complex web of spreadsheets, impeding efficient data analysis. The manual data extraction from prominent platforms such as Amazon, Shopify, and various marketplaces presented a significant bottleneck for Kapiva's analysts. To overcome these challenges, our team embarked on a comprehensive data transformation journey, breaking it down into various phases: Designing the Data Stack, Extract and Load the data , Data Modeling, Metabase Deployment.
Designing the Data Stack
Data analytics is a dynamic process driven by the quest for insights within the data. Recognizing the iterative nature of this process, our teams were diligent in selecting the optimal architecture for this transformative project.
The first part involved the extraction and loading of data into the data warehouse. To expedite this process, we opted for an EL (Extract ,Load) tool customized for the unique needs of the Indian market. This strategic decision significantly reduced the time required to build a bespoke EL solution from the ground up. For sources that were not readliy available we built a pipeline on Python and scheduled the EL using our scheduler tool.
Our analysts delved deep into the schema of each data source, meticulously crafting Fact and Dimension tables to house the data in a standardized and unified format. Understanding the importance of managing deployments and environments effectively, our team chose to implement DBT (Data Build Tool) to streamline these processes.
To facilitate transparent communication of changes within the data warehouse, we deployed DBT Docs, an invaluable resource that empowered downstream analysts to comprehend the wealth of information residing in the Data Warehouse.
For visualising data we used metabase. It is a user friendly tool with a very low learning curve to get started. The easy to use aggregation, join and filter interface let the Kapiva team build their own view over the Fact and Dim tables modelled for them.
Extract and Load Data
To extract data from various sources we split the architecture into two parts. We used a EL tool for majority of the connectors. This helped our team move faster towards data modelling. Where the actual business value will be derived. A handful of source did not have pre-built connectors. For which we designed a customer ETL pipeline. Designing these pipelines was a challenging task as the downstream API's were inconsistent with the current industry formats.
Our analysts spent a lot of time with engineering team of these SaaS companies to correct API responses. Eventually which making sure the data that entered the raw schema's of the data warehouse was clean and error proof.
In the end the output at the end of this stage helped the analytics engineer to write the data models to collate data from various sources.
Data Modeling
The most formidable challenge encountered during this deployment was adapting to rapidly changing requirements. Kapiva, as a startup, was onboarding new vendors and shipping partners at a rapid pace, simultaneously transitioning from Woocommerce to Big Commerce for their storefront. This dynamic environment necessitated agile data models and transformations capable of seamlessly accommodating new connectors added on a monthly basis.
DBT once again proved its mettle by enabling us to swiftly spin up Dev (Development), UAT (User Acceptance Testing), and Prod (Production) instances. This agile setup empowered analysts to rigorously test models before deploying them into the production environment, ensuring data accuracy and reliability.
To address issues stemming from the idiosyncrasies of certain APIs, our team implemented a robust quality control mechanism. We leveraged DBT to deploy test cases that acted as vigilant gatekeepers. When the designed test cases failed, upstream pipelines automatically halted, preventing incorrect data from infiltrating the production tables.
Data Visualization
For the crucial visualization layer, our team opted for Metabase, primarily due to its user-friendly interface. Metabase empowered Kapiva's business users, enabling them to swiftly extract actionable insights from the data warehouse.
However, managing the cloud capacity for Metabase, with over 100 analysts on board, presented a unique challenge. To ensure stability and cost-efficiency, we engineered a custom backup architecture that provided resilience in case of Metabase failures during runtime. This custom solution allowed Kapiva to maintain control over costs and capacity, particularly crucial during the early stages of their data transformation journey, when Elastic Beanstalk and auto-scaling functionalities proved cost-prohibitive.
In conclusion, our holistic approach to transforming Kapiva's data infrastructure not only streamlined their data processes but also equipped them to adapt seamlessly to the ever-evolving demands of their dynamic business environment. This data-driven foundation now empowers Kapiva to make informed decisions, drive growth, and maintain a competitive edge in the fast-paced world of e-commerce.