Blog | JUN 22, 2026
Introducing the Data BOM: A Bill of Materials for Data Provenance
As AI and automation act on data with less and less human oversight, the trustworthiness of that data becomes the trustworthiness of the action. The Data BOM is a verifiable record of where data came from, built to sit alongside the SBOM and ML BOM. In this post we are introducing the concept behind the world's first Data Bill of Materials.
We have learned to document what our systems are made of. A Software Bill of Materials (SBOM) tells us which components and dependencies go into a piece of software. A Machine Learning Bill of Materials (ML BOM) tells us how a model is built, what its parameters are, and where to find the dataset it was trained on. Both have become reference points for transparency and trust.
Something is still missing. We document the software. We document the model. We do not yet have a structured, verifiable record of the underlying data itself. That data is what feeds the models, drives the analytics, triggers the automation, and creates the reports. It travels with no portable account of where it came from, when it was produced, and whether it can be trusted.
This is the gap the Data BOM is built to close. The Data BOM, also called DataBOM or DBOM, is a structured manifest that describes data assets and makes their provenance verifiable.
What a Data BOM is
A Data BOM is a manifest for data. It describes data assets: it records what each data asset is, where it came from, the time it covers, and a cryptographic commitment to its exact content. The Data BOM as a whole also carries a verifiable attestation of who created it and that it has not been tampered with.
It is built on the same bill-of-materials thinking that produced the SBOM and the ML BOM, and it is designed to sit alongside them rather than replace anything. A data asset can be a curated dataset, a snapshot taken over a defined window, a feed from a sensor, a table, or a selection drawn from any of these. Whatever the shape of the data asset, the Data BOM answers the same questions: what is this, where did it originate, what period does it cover, and how do I know it has not been altered.
Where it sits: SBOM, Data BOM, ML BOM
The three bills of materials describe three different things, and together they give a complete account of a system. The SBOM describes the software. The ML BOM describes the model. The Data BOM describes the data flowing between them.
Training data is one use case, and an important one, but it is not the boundary. The same manifest serves any situation where a result depends on data whose origin has to be trusted: analytical work that must be reproducible, automated decisions that act on live inputs, regulatory and operational reporting, and audit. Anywhere data feeds a decision, a Data BOM can describe and attest to the data assets behind it.
)
Data BOM use case: Industrial AI in the energy sector
A concrete example shows how the three fit together. Consider health forecasting for secondary substation transformers, where a machine learning model predicts the thermal behavior of transformers and supports load-balancing decisions across a cluster of them.
Edge devices on each transformer capture raw measurement streams: load currents, oil temperatures, ambient temperature, tap positions, and cooling status. Those streams are the data assets that everything downstream depends on. A Data BOM documents them at the point of capture, with their origin, the time window they cover, and an integrity commitment to their content.
When the model is trained, an ML BOM describes the resulting model and links to the Data BOM that documents its training data. When the model later produces a decision, such as a recommendation to shift load from a thermally constrained transformer to one with headroom, a second Data BOM documents exactly which data assets fed that specific decision. The software running on the edge device that produced the original streams is described by its own SBOM.
The result is a chain that can be followed in both directions. An operator or auditor reviewing a single control decision can trace it back through the inference data, to the model, to the training data, to the raw streams, and down to the software that produced them. Every link in that chain carries its own verifiable record. This is the picture the visual below describes.
)
From blind trust to verifiable attestation
Today, trusting where data came from usually means placing blind trust in the system. A Data BOM, removes the need to take that on faith.
This is the principle at the center of the Data BOM. A manifest that asserts where data came from is a label. The Data BOM goes further on two levels. Each data asset it lists carries a cryptographic commitment to its content and a reference to how that content can be verified. The manifest as a whole carries a verifiable attestation of who created it and that it has not been altered. The Data BOM holds these commitments and points to the provenance solutions where verification happens, rather than performing the full check itself.
)
Why data provenance matters more than ever
The reason this matters now is the speed at which machine learning and automation are entering operational decisions, and the way agentic AI is accelerating that shift. Systems increasingly act on data without a human reviewing each input. When software decides and then acts, the trustworthiness of the data it consumed becomes the trustworthiness of the action itself.
Fake data is, in many ways, the same problem as fake news. Social platforms showed how convincingly fabricated content spreads and how hard it becomes to tell what is real. The same dynamic is now arriving in the data layer. With generative tools widely available, a malicious actor can fabricate data that looks entirely plausible. How do you know the readings driving a model are real measurements and not something manufactured to manipulate an outcome? How do you know a training set was not quietly poisoned? Without a verifiable record of origin, you do not.
Regulators are converging on the same concern. The EU AI Act sets out provenance, documentation and traceability expectations around the data used by high-risk AI systems. The NIST AI Risk Management Framework and related guidance call for provenance and for protection against data poisoning. Across these frameworks the direction is consistent: it is no longer enough to use data, you have to be able to show where it came from and demonstrate that it can be trusted. A Data BOM gives that demonstration a portable, verifiable form.
An invitation to contribute
We believe the Data BOM should be an open standard, not a proprietary format. The whole value of a bill of materials comes from everyone being able to read it, produce it, and verify it. A provenance record that only one vendor can interpret defeats its own purpose.
So we are building the Data BOM on top of established bill-of-materials practice rather than inventing a parallel approach. We are already working with relevant stakeholders in the SBOM and ML BOM domains to make sure the Data BOM fits into an integrated approach rather than standing apart from it.
This is an open initiative, and we are inviting others to take part. If you work on data provenance, AI and ML systems, industrial automation, or the standards that govern them, we would like you to contribute. We will share the technical proposal and the reference work behind it in the near future.
Data deserves a bill of materials of its own. Join us in defining the standard.
Blog | JUN 22, 2026
)
)
)
)