Skip to main content
Version: 1.3.1.0

ODP as the Foundation for AI Workloads

What ODP Is and Is Not

Let's be direct: ODP does not include AI models, large language models, or machine learning frameworks beyond Spark MLlib (covered in the next section). ODP is a data platform, not an AI engine.

What ODP provides is the governed data foundation that AI workloads require to be trustworthy, reproducible, and compliant. The quality, governance, and accessibility of training data matters as much as model architecture. ODP is where that data layer is built.

You bring your ML framework — PyTorch, TensorFlow, scikit-learn, Hugging Face, or any other tool. ODP ensures the data those frameworks consume is governed, audited, versioned, and sovereign.

What "AI-Ready" Means for ODP

An AI-ready data platform must answer the following questions reliably:

  • Where does this training data come from? (lineage)
  • Who can access it? (access control)
  • What did it look like at the time a model was trained? (versioning / time travel)
  • Has any sensitive data leaked into training sets? (governance)
  • Where is the data physically stored? (sovereignty)

ODP addresses all five through its integrated components:

ODP CapabilityAI Relevance
Apache IcebergVersioned, open table format for training datasets; time travel for ML reproducibility
Apache AtlasData lineage for model traceability; know exactly what data trained your model
Apache RangerFine-grained access control for training data; tag-based policies on sensitive columns
Spark MLlibClassical ML algorithms running natively on YARN, reading Iceberg data
Polaris MCP ServerLLM access to catalog metadata without exposing raw data
On-premise deploymentSovereign infrastructure; training data never leaves your datacenter

The Governed Lakehouse Architecture for AI

ODP implements what is often called a governed lakehouse: the combination of open, scalable data lake storage (HDFS/Ozone + Iceberg) with enterprise governance (Ranger, Atlas, Kerberos). This architecture is well suited as the data layer for AI workloads.

┌─────────────────────────────────────────────────────┐
│ Your AI/ML Workloads │
│ (PyTorch, TensorFlow, scikit-learn, LLM APIs, …) │
└───────────────────────┬─────────────────────────────┘
│ reads governed data
┌───────────────────────▼─────────────────────────────┐
│ ODP Data Foundation │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Iceberg │ │ Atlas │ │ Ranger │ │
│ │ (versioned│ │(lineage, │ │(access control, │ │
│ │ datasets)│ │ catalog) │ │ audit trails) │ │
│ └────┬─────┘ └────┬─────┘ └────────┬─────────┘ │
│ │ │ │ │
│ ┌────▼──────────────▼──────────────────▼──────────┐│
│ │ HDFS / Ozone (sovereign storage) ││
│ └─────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────┘

How ODP Complements Your AI Tools

Data Ingestion and Preparation

Your ML pipeline typically starts with raw data ingestion, cleaning, and feature engineering. ODP provides:

  • NiFi: for data ingestion from diverse sources (APIs, databases, IoT streams) with built-in governance
  • Spark: for large-scale data transformation and feature engineering
  • Hive / Iceberg: as the storage layer for cleaned datasets and feature tables

Once data is ingested and prepared, it is stored in Iceberg tables — an open format readable by any ML framework that supports Parquet (which is all of them).

Training Data Access

Your training scripts (Python, R, Scala) read training data from ODP storage. For Iceberg data:

# PySpark reading an Iceberg training dataset
from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("model-training") \
.config("spark.sql.catalog.hive_catalog", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.hive_catalog.type", "hive") \
.getOrCreate()

# Read the dataset as it existed at training time (time travel)
df = spark.read \
.option("as-of-timestamp", "2025-10-01T00:00:00Z") \
.table("hive_catalog.ml_datasets.customer_features")

# Train your model
# (using PySpark MLlib, or converting to pandas/numpy for scikit-learn, etc.)

Ranger access control applies at the point of read: if a Spark job's service principal does not have SELECT permission on the table, the read fails with an authorization error — before any training occurs. This prevents accidental use of unauthorized data.

Model Artifacts

ODP does not manage ML model artifacts (weights files, ONNX models, etc.) natively. However, HDFS is a natural storage backend for large binary files, and the Iceberg format can be extended to store model metadata as table properties. Teams often store model artifacts in HDFS alongside their training data, with Atlas entities created to link the model to its training dataset.

The EU AI Act and ODP

The EU AI Act (effective August 2024) imposes requirements on high-risk AI systems that directly concern data governance:

EU AI Act RequirementODP Capability
Training data must be documented (Art. 10)Atlas lineage captures where data came from and how it was transformed
Data quality and representativeness (Art. 10)Iceberg schema enforcement, Hive statistics, Impala profiling
Data access must be controlled (Art. 10)Ranger policies govern who can read/write training datasets
Audit trails must be maintained (Art. 9)Ranger audit log + HDFS audit log + Atlas operational metadata
Right to explanation (Art. 13)Atlas lineage shows what data contributed to model inputs

ODP does not automatically make your AI system compliant with the EU AI Act — compliance requires organizational processes and documentation. But ODP provides the technical infrastructure that makes compliance evidence available and audit trails machine-readable.

The Sovereign Advantage

For organizations in healthcare, finance, government, and defense, the question is not just about governance — it is about where data physically resides.

Cloud-based AI services (training APIs, foundational model fine-tuning, embedding services) require uploading data to external infrastructure. For organizations subject to regulations or contractual obligations that prohibit data leaving their premises or jurisdiction, this is a hard blocker.

ODP solves this by running entirely on-premise (or in a sovereign private cloud). Your training data stays in your datacenter. Your models are trained on your infrastructure. No data is transmitted to external services.

This aligns with:

  • SecNumCloud qualification requirements (French ANSSI)
  • GDPR data residency requirements
  • NIS2 requirements for critical infrastructure operators
  • Healthcare data regulations (HDS in France, NHS data governance in the UK)

The capability of running advanced analytics and ML workloads on fully on-premise, governed infrastructure — without depending on external cloud services — is ODP's core sovereign advantage.

What's Coming

The Clemlab team is actively working on AI-related features for future ODP releases. While we do not disclose specific timelines here, the areas under active development include deeper integration between ODP's governance stack and AI/ML workflow orchestration tools.

For current status, follow the ODP release notes and the Clemlab GitHub repository.