Design a data lakehouse for retail inventory analytics

Grocery customers list out-of-stock merchandise as a key reason for a bad shopping experience. Beyond product availability, consumers also expect their time in-store to be brief and efficient. Now, more than ever, retailers must provide a seamless shopping experience and the key element to making that happen is data.

Retailers want to capture a wealth of data and turn to cloud-based big data solutions to aggregate and manage data for real-time stock visibility. A data lakehouse engineered on Oracle Cloud Infrastructure (OCI) can capture, manage and gain insight from data produced from point of sale, inventory, customer and operational systems to understand real time inventory management.

This reference architecture introduces a platform topology, component overview, and recommended best practices for implementing a successful data lakehouse on OCI.

This data lakehouse architecture scenario, applicable to retail business, involves these personas:
  • Customers, who interact with the merchant online (web or mobile), with pickup or delivery, or physically at the stores, whether it is by interaction with a store employee, or via self-service machines.
  • Store managers, who want to have visibility onto how products and product categories are selling, get predictive insights such as inventory consumption and drive automatic actions, e.g. automated procurement.
  • Upper management, who is interested in advanced real-time analytics with visualization, reporting and AI capabilities.
  • Data scientists, working on big data, with growing data quantity and number of sources, requiring fast processing and flexibility to easily deploy models .
  • Low-code developers, working on existing and new data-driven applications, with a focus on simplicity and with the least possible time spent managing security and operations.


One of the main complexities of the retail business is the multiplicity of systems and data models and types, as well as an ever-growing quantity of data. Such a challenge calls for simplification and consolidation, which is something an OCI data lakehouse architecture can help accomplish.

The following diagram introduces the conceptual retail business lakehouse reference architecture.

Description of retail-lakehouse-arch.png follows
Description of the illustration retail-lakehouse-arch.png

Autonomous Data Warehouse (ADW) is one of the central pieces of the OCI data lakehouse architecture. It automates provisioning, configuring, securing, tuning, scaling, and backing up of the data warehouse. It includes tools for self-service data loading, data transformations, business models, automatic insights, and built-in converged database capabilities that enable simpler queries across multiple data types and machine learning analysis. Machine learning on ADW brings the advantage of having algorithms right where the data is, for maximized performance. ADW is closely integrated with the OCI Object Storage, which here serves as a data lake, as an unlimited and low-cost storage for unstructured data.

Data science and machine learning initiatives can result in outcomes such as intelligent sale forecasts based on season, occurrence of marketing campaigns, characteristics of the customer population (e.g. age groups) and location, with Oracle Spatial & Graph providing the necessary location support. Such initiatives can be supported by ADW's OML notebooks (based on Apache Zeppelin) and accessible through OAC, by using Data Science (JupyterLab/Python-centric), and Oracle APEX comes into picture as the gold standard for low-code custom apps.

The different channels through which customers interact with the merchant, as visible in the diagram and mentioned above, often rely on tailor-made applications. Oracle Container Engine for Kubernetes is a robust platform that provides scalability and additional control over microservices and applications.

An example of a modern enterprise AI usage is the digital assistant. In this use case, AI-driven digital assistants are used, based on lakehouse data, for a conversational interface for apps and kiosks with actionable recommendations.

OCI Language is one of the most relevant AI services in this scenario, which can help businesses improve their customer experience while reducing the time and effort to analyze text data. The service has multiple use cases, including:
  • Marketing: Analyze social media, reviews, and news to see what customers and industry experts are saying about your product. See what they do and don’t like, what new features they want, and how you compare to your competitors.
  • Customer support: Classify support tickets by product and department, so that tickets get to the appropriate team faster. Use sentiment analysis to identify urgent pain points and prioritize tickets.
  • Human resources: Automate resume screening by using entity recognition to identify key skills and education. Classify employee feedback using sentiment analysis and entity recognition to identify the most common pain points among employees and the best next steps to take.
The flexible lakehouse architecture supports multiple scenarios across based on Oracle Data Science, AI services, combining Autonomous Data Warehouse and data lake capabilities in integration with other OCI services. In addition, this architecture also uses services such as Data Catalog and Oracle Analytics Cloud. This lakehouse architecture serves multiple purposes, including storing important data in secure reliable and quick retrieval storage, being the source for machine learning modules, and providing advanced reporting capabilities for internal and external usage.

With the data lakehouse, you can leverage data from anywhere, normalized data on the fly, run embedded AI/ML at Exadata scale, autoscale up/down at anytime (ADW), and rely on advanced security controls to greatly reduce risk.

In the representation above, we can find the following integrations:
  • Oracle ERP, CRM, POS and external platforms send data and events to the data lakehouse, real-time or through batch processing, with the help of Oracle GoldenGate and Oracle Data Integration.
  • Oracle Integration Cloud, in this example, plays the additional role of sending data from the data lakehouse to Oracle Procurement, through which suppliers can be notified of the automatically placed purchasing orders.
  • ADW uses a query accelerator for fast and seamless querying of the Object Storage data lake.
  • The scalable API Gateway exposes data at scale to apps
  • Oracle Analytics Cloud and Oracle Data Science integrate seamlessly with the services from the data lakehouse.
This architecture contains the following OCI components:
  • Autonomous Data Warehouse

    A fully managed Oracle and autoscaling autonomous database that includes Oracle Machine Learning. Data scientists can build, evaluate, score, and deploy machine learning models using in-database Oracle Machine Learning features and the related Notebooks interface.

  • Object Storage

    OCI Object Storage is an internet-scale, high-performance storage platform that offers reliable and cost-efficient data durability. Object Storage can store an unlimited amount of unstructured data of any content type, including analytic data. You can safely and securely store or retrieve data directly from the internet or from within the cloud platform. Multiple management interfaces let you easily start small and scale seamlessly, without experiencing any degradation in performance or service reliability.

    Object Storage can also be used as a cold storage layer for the data warehouse by storing data that is used infrequently and then joining it seamlessly with the most recent data by using hybrid tables in Oracle Autonomous Data Warehouse.

  • Data Catalog

    OCI Data Catalog is a fully managed, self-service data discovery and governance solution for your enterprise data. Data Catalog provides a single collaborative environment to manage technical, business, and operational metadata.

  • Oracle Analytics Cloud

    Oracle Analytics Cloud is a scalable and secure public cloud service that empowers business analysts with modern, AI-powered, self-service analytics capabilities for data preparation, visualization, enterprise reporting, augmented analysis, and natural language processing and generation. With Oracle Analytics Cloud, you also get flexible service management capabilities, including fast setup, easy scaling and patching, and automated lifecycle management.

    Oracle Analytics Cloud is integrated with Oracle Machine Learning. This integration allows analysts to list available in-database models and use those models in Oracle Analytics Cloud analytics and dashboards. OAC Data Visualization allows users to apply pre-built machine learning models or own trained models while visualizing data.

  • Data Science

    OCI Data Science is a fully managed, serverless platform for data science teams to build, train, and manage machine learning (ML) models using OCI. It can easily integrate with other OCI services like Autonomous Data Warehouse, Object Storage, and more. You can build and evaluate high-quality machine learning models, Increase business flexibility by putting enterprise-trusted data to work quickly, and support data-driven business objectives with easier deployment of ML models.

    Data Science integrates with the rest of the OCI stack, including Functions, Data Flow, Autonomous Data Warehouse, and Object Storage. Oracle Accelerated Data Science (ADS) software developer kit (SDK) is a Python library that's included as part of the OCI Data Science service, which has many functions and objects that automate or simplify the steps in the Data Science workflow, including connecting to data, exploring and visualizing data, training a model with AutoML, evaluating models, and explaining models. ADS also provides a simple interface to access the Data Science service model catalog and other OCI services, including Object Storage.

  • Oracle Data Integration

    Use OCI Data Integration for optimal data flow between systems. It supports declarative and no-code or low-code ETL and data pipeline development.

  • GoldenGate

    Oracle Cloud Infrastructure GoldenGate is a managed service providing a real-time data mesh platform, which uses replication to keep data highly available, and enabling real-time analysis. Customers can design, execute, and monitor their data replication and stream data processing solutions without the need to allocate or manage compute environments.

  • API Gateway

    The API Gateway service enables you to publish APIs with private endpoints that are accessible from within your network, and which you can expose to the public internet if required. The endpoints support API validation, request and response transformation, CORS, authentication and authorization, and request limiting.

  • Virtual cloud network

    One of your first steps in OCI is to set up a virtual cloud network (VCN) for your cloud resources. A VCN is a software-defined network that you set up in an OCI region. VCNs can be segmented into subnets, which can be specific to a region or to an availability domain. Both region-specific and availability domain-specific subnets can coexist in the same VCN. A subnet can be public or private.

  • Container Engine for Kubernetes

    OCI Container Engine for Kubernetes is a fully managed, scalable, and highly available service that you can use to deploy your containerized applications to the cloud. You specify the compute resources that your applications require, and Container Engine for Kubernetes provisions them on Oracle Cloud Infrastructure in an existing tenancy. Container Engine for Kubernetes uses Kubernetes to automate the deployment, scaling, and management of containerized applications across clusters of hosts.

  • Registry

    OCI Registry is an Oracle-managed registry that enables you to simplify your development-to-production workflow. Registry makes it easy for you to store, share, and manage development artifacts, like Docker images. The highly available and scalable architecture of Oracle Cloud Infrastructure ensures that you can deploy and manage your applications reliably.


Use the following recommendations as a starting point to designing a data lakehouse for OCI. Your requirements might differ from the architecture described here.
  • VCN

    When you create a VCN, determine the number of CIDR blocks required and the size of each block based on the number of resources that you plan to attach to subnets in the VCN. Use CIDR blocks that are within the standard private IP address space.

    Select CIDR blocks that don't overlap with any other network (in Oracle Cloud Infrastructure, your on-premises data center, or another cloud provider) to which you intend to set up private connections.

    After you create a VCN, you can change, add, and remove its CIDR blocks.

    When you design the subnets, consider your traffic flow and security requirements. Attach all the resources within a specific tier or role to the same subnet, which can serve as a security boundary.

  • Security

    Use policies to restrict who can access the OCI resources that your company has and how they can access them.

    Use Oracle Cloud Guard to monitor and maintain the security of your resources in OCI proactively. Cloud Guard uses detector recipes that you can define to examine your resources for security weaknesses and to monitor operators and users for risky activities. When any misconfiguration or insecure activity is detected, Cloud Guard recommends corrective actions and assists with those actions, based on responder recipes that you can define. For resources that require maximum security, Oracle recommends that you use security zones. A security zone is a compartment associated with an Oracle-defined recipe of security policies that are based on best practices. For example, the resources in a security zone must not be accessible from the public internet and they must be encrypted using customer-managed keys. When you create and update resources in a security zone, OCI validates the operations against the policies in the security-zone recipe, and denies operations that violate any of the policies.

  • Autonomous Data Warehouse

    Object Storage offers reliable and cost-efficient data durability, it provides quick access to large amounts of structured and unstructured data of any content type, including database data, analytic data, images, videos and more. We recommend using standard storage to ingest data from external sources and use it for further processing since it can access quickly and frequently. You can build a lifecycle policy to move the data to cold storage from standard when it's no longer required frequently.

  • Data Catalog

    Data Catalog To have a complete and holistic end-to-end view of the data stored and flowing on the platform, consider harvesting not only data stores supporting the data persistence layer but also the source data stores. Mapping this harvested technical metadata to the business glossary and enriching it with custom properties allows you to map business concepts and to document and govern security and access definitions.

    To facilitate the creation of Oracle Autonomous Data Warehouse external tables that virtualize data stored on Oracle Cloud Infrastructure Object Storage, leverage the metadata previously harvested by Oracle Cloud Infrastructure Data Catalog. This simplifies the creation of external tables, enforces consistency of metadata across data stores, and is less susceptible to human error.

Explore More

Learn more about designing retain inventory analytic solutions with a data lakehouse.

Review these additional resources: