Data taxonomy: A base requirement for scalable digitalization across renewable assets
By Giuseppe Ferraro, Director of Innovation, GreenPowerMonitor, a DNV company
As renewable portfolios expand rapidly in size and complexity, the industry’s ability to apply artificial intelligence (AI) and machine learning (ML) effectively could be increasingly constrained by one underlying issue: the lack of a consistent and well-defined data taxonomy.
Although taxonomy challenges first emerged within the solar sector, the same structural problem now affects wind, BESS and the emerging class of assets linked to hydrogen production, ammonia synthesis or carbon capture systems. Before any advanced analytics can operate reliably, the essential task remains unchanged: understanding what data refers to and ensuring that it is classified consistently.
Why inconsistent taxonomy limits using AI at scale
This problem originates from the way operational data has been historically created. Each device, whether a solar inverter, a wind turbine converter or a BESS controller, produces its own collection of measurement points, events and semantic tags. Early deployment practices in the PV sector, encouraged flexibility, allowing engineers, SCADA integrators and regional teams to define naming conventions according to their own preferences.
At the time, this approach offered adaptability. Today, however, it has led to a highly fragmented landscape where thousands of tags describing the same underlying concept differ in wording, structure and context. When multiplied across large portfolios, inconsistent taxonomies severely undermine the reliability of AI and ML models. The principle is straightforward: no model can extract insight from data whose meaning is unclear.

Separating semantic clarity from asset structure to offer control and flexibility
Another interesting challenge arises when distinguishing taxonomy from the generation of asset models. Taxonomy provides the semantic clarity (the meaning layer) but it does not automatically infer how components relate to one another, nor does it construct hierarchical trees. The creation of a complete asset model requires dedicated tools capable of interpreting diagrams, understanding equipment relationships and reconstructing the structure of the plant.
This distinction has guided recent feasibility work on how best to automate asset-model generation while ensuring alignment with taxonomy methodologies. One of the pathways under review is the use of domain-specific document parsing, computer vision and machine learning. The end goal is to extract equipment from engineering documentation, interpret diagram structures and generate technology-specific hierarchical trees automatically.
This approach offers the highest level of control and flexibility, ensuring that hierarchy extraction remains consistent with internal standards, supports multiple workflows and allows engineering teams to validate and refine structures.
Using AI and NLP to align taxonomy at scale and support management of assets and data points
Once this separation between taxonomy (semantic clarity) and hierarchical extraction (structural clarity) is recognised, the role of AI in taxonomy alignment becomes central. AI-enabled approaches, especially those using natural language processing (NLP), offer effective mechanisms for analysing large volumes of heterogeneous operational tags and assigning them to a standard taxonomy. Vectorization techniques change human-defined labels into numerical forms that can be compared algorithmically. Cosine similarity detects relationships between superficially different labels, identifying potential equivalence in their meaning. However, lexical similarity alone is insufficient, as many inconsistencies arise from stylistic differences, regional conventions or legacy engineering practices rather than terminology alone.
This is where semantic models such as Word Mover’s Distance (WMD) become valuable. Trained on broad linguistic datasets, WMD evaluates the deeper meaning of text by analysing how words relate to one another in context. It can identify, for example, that ‘DC array voltage’, ‘inverter DC input V’ and ‘string Vdc’ may describe the same operational concept, despite differing wording. Early internal testing has shown that AI-assisted taxonomy alignment can achieve 70–80% accuracy for inhomogeneous solar datasets, and similar levels of performance can be expected as models expand to encompass wind, BESS and future technologies.
This approach offers an important advantage: scalability. Manual taxonomy alignment is no longer realistic for operators managing thousands of assets or millions of data points. AI reduces the burden substantially, allowing engineers to focus on validation rather than raw classification, and providing a stable semantic framework on which digital tools can depend.

Building a future-proof data foundation and the benefits of this approach
At the same time, the combination of AI-enabled taxonomy classification and automated hierarchy extraction offers a pathway for the renewable industry to manage the growing diversity of asset types. As new technologies emerge, such as green hydrogen systems, ammonia synthesis units, and CCUS facilities, the methodology remains the same: establish naming principles, train NLP models to classify tags and extend hierarchical extraction tools to new equipment classes. This creates a data foundation capable of accommodating both existing and future technologies.
The benefits of this approach extend well beyond simplifying data processing. A consistent taxonomy enhances interoperability across digital systems, ensures that asset models can be constructed accurately, strengthens the reliability of AI-based analytics and supports long-term comparability of operational data. It allows historical trends and cross-technology insights to become accessible in a way that would be impossible with fragmented naming conventions. Taxonomy then becomes an enabler of digital maturity, operational optimization, and investment confidence.
In conclusion, data taxonomy remains one of the most critical prerequisites for applying AI and ML effectively across renewable assets. It provides the semantic foundation required for analytics, forecasting and automation. Complementary tools, such as AI-driven hierarchical extraction systems using document parsing, computer vision and machine learning, address the structural dimension of asset-model construction.
Together, these capabilities create a scalable, extensible framework that enables operators to interpret data consistently, maintain digital accuracy and extract meaningful value across the diverse technologies that define the future energy landscape.