Version: 1.3.1.0

ODP as the Foundation for AI Workloads

What ODP Is and Is Not

Let's be direct: ODP does not include AI models, large language models, or machine learning frameworks beyond Spark MLlib (covered in the next section). ODP is a data platform, not an AI engine.

What ODP provides is the governed data foundation that AI workloads require to be trustworthy, reproducible, and compliant. The quality, governance, and accessibility of training data matters as much as model architecture. ODP is where that data layer is built.

You bring your ML framework — PyTorch, TensorFlow, scikit-learn, Hugging Face, or any other tool. ODP ensures the data those frameworks consume is governed, audited, versioned, and sovereign.

What "AI-Ready" Means for ODP

An AI-ready data platform must answer the following questions reliably:

Where does this training data come from? (lineage)
Who can access it? (access control)
What did it look like at the time a model was trained? (versioning / time travel)
Has any sensitive data leaked into training sets? (governance)
Where is the data physically stored? (sovereignty)

ODP addresses all five through its integrated components:

ODP Capability	AI Relevance
Apache Iceberg	Versioned, open table format for training datasets; time travel for ML reproducibility
Apache Atlas	Data lineage for model traceability; know exactly what data trained your model
Apache Ranger	Fine-grained access control for training data; tag-based policies on sensitive columns
Spark MLlib	Classical ML algorithms running natively on YARN, reading Iceberg data
Polaris MCP Server	LLM access to catalog metadata without exposing raw data
On-premise deployment	Sovereign infrastructure; training data never leaves your datacenter

The Governed Lakehouse Architecture for AI

ODP implements what is often called a governed lakehouse: the combination of open, scalable data lake storage (HDFS/Ozone + Iceberg) with enterprise governance (Ranger, Atlas, Kerberos). This architecture is well suited as the data layer for AI workloads.

Governed lakehouse architecture for AI: AI and ML workloads read governed data from the ODP Data Foundation (Iceberg, Atlas, Ranger) layered on HDFS or Ozone sovereign storage

How ODP Complements Your AI Tools

Data Ingestion and Preparation

Your ML pipeline typically starts with raw data ingestion, cleaning, and feature engineering. ODP provides:

NiFi: for data ingestion from diverse sources (APIs, databases, IoT streams) with built-in governance
Spark: for large-scale data transformation and feature engineering
Hive / Iceberg: as the storage layer for cleaned datasets and feature tables

Once data is ingested and prepared, it is stored in Iceberg tables — an open format readable by any ML framework that supports Parquet (which is all of them).

Training Data Access

Your training scripts (Python, R, Scala) read training data from ODP storage. For Iceberg data:

# PySpark reading an Iceberg training dataset
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("model-training") \
    .config("spark.sql.catalog.hive_catalog", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.hive_catalog.type", "hive") \
    .getOrCreate()

# Read the dataset as it existed at training time (time travel)
df = spark.read \
    .option("as-of-timestamp", "2025-10-01T00:00:00Z") \
    .table("hive_catalog.ml_datasets.customer_features")

# Train your model
# (using PySpark MLlib, or converting to pandas/numpy for scikit-learn, etc.)

Ranger access control applies at the point of read: if a Spark job's service principal does not have SELECT permission on the table, the read fails with an authorization error — before any training occurs. This prevents accidental use of unauthorized data.

Model Artifacts

ODP does not manage ML model artifacts (weights files, ONNX models, etc.) natively. However, HDFS is a natural storage backend for large binary files, and the Iceberg format can be extended to store model metadata as table properties. Teams often store model artifacts in HDFS alongside their training data, with Atlas entities created to link the model to its training dataset.

The EU AI Act and ODP

The EU AI Act (effective August 2024) imposes requirements on high-risk AI systems that directly concern data governance:

EU AI Act Requirement	ODP Capability
Training data must be documented (Art. 10)	Atlas lineage captures where data came from and how it was transformed
Data quality and representativeness (Art. 10)	Iceberg schema enforcement, Hive statistics, Impala profiling
Data access must be controlled (Art. 10)	Ranger policies govern who can read/write training datasets
Audit trails must be maintained (Art. 9)	Ranger audit log + HDFS audit log + Atlas operational metadata
Right to explanation (Art. 13)	Atlas lineage shows what data contributed to model inputs

ODP does not automatically make your AI system compliant with the EU AI Act — compliance requires organizational processes and documentation. But ODP provides the technical infrastructure that makes compliance evidence available and audit trails machine-readable.

The Sovereign Advantage

For organizations in healthcare, finance, government, and defense, the question is not just about governance — it is about where data physically resides.

Cloud-based AI services (training APIs, foundational model fine-tuning, embedding services) require uploading data to external infrastructure. For organizations subject to regulations or contractual obligations that prohibit data leaving their premises or jurisdiction, this is a hard blocker.

ODP solves this by running entirely on-premise (or in a sovereign private cloud). Your training data stays in your datacenter. Your models are trained on your infrastructure. No data is transmitted to external services.

This aligns with:

SecNumCloud qualification requirements (French ANSSI)
GDPR data residency requirements
NIS2 requirements for critical infrastructure operators
Healthcare data regulations (HDS in France, NHS data governance in the UK)

The capability of running advanced analytics and ML workloads on fully on-premise, governed infrastructure — without depending on external cloud services — is ODP's core sovereign advantage.

What's Coming

The Clemlab team is actively working on AI-related features for future ODP releases. While we do not disclose specific timelines here, the areas under active development include deeper integration between ODP's governance stack and AI/ML workflow orchestration tools.

For current status, follow the ODP release notes and the Clemlab GitHub repository.

ODP as the Foundation for AI Workloads

What ODP Is and Is Not​

What "AI-Ready" Means for ODP​

The Governed Lakehouse Architecture for AI​

How ODP Complements Your AI Tools​

Data Ingestion and Preparation​

Training Data Access​

Model Artifacts​

The EU AI Act and ODP​

The Sovereign Advantage​

What's Coming​