What Does 'Good Data' Mean for Machine Learning in Industry?

When implementing machine learning solutions for industrial manufacturing, the quality of your data fundamentally determines what's possible. But what exactly constitutes "good data" in a factory setting? Let's explore this critical question.

Understanding the Context of Industrial Data

In manufacturing environments, data is generated from numerous sources - sensors, control systems, quality inspections, laboratory measurements, and production logs. However, not all data is equally valuable for solving challenges using machine learning applications. The definition of "good data" is deeply contextual and depends on your specific objectives.

 

How Your Goal Determines Data Requirements

Whether the goal is improving efficiency, optimizing resources, reducing emissions, or enhancing production quality, machine learning offers solutions that can support and accelerate the process. Anomaly detection - to identify deviations from expected behavior - and process forecasting - to anticipate and mitigate potential issues before they arise - represent two meaningful applications of machine learning in manufacturing environments. Although these approaches can significantly improve operational efficiency or product quality, each demands different data characteristics to succeed.

There are several factors that determine data suitability across applications:

  • Time and data resolution - How frequently measurements are taken, considering aggregation methods and deadband
  • Historical range - How far back in time your data extends
  • Contextual richness - What additional variables provide relevant context
  • Data accuracy - How trustworthy the values are, considering:
    • sensor calibration
    • connection stability
    • presence of outliers

Let's examine how these factors play out in the anomaly detection and process forecasting applications.

Anomaly Detection

image (4)Example Intelecy Anomaly Detection Model

When your goal is to identify unexpected behavior or potential failures, you typically need:

  • High-frequency data: Capturing subtle deviations often requires sampling rates of seconds
  • Multi-dimensional readings: Combining temperature, pressure, vibration, and other parameters provides a comprehensive view
  • Historical examples of failures: Although models learn best from normal process behavior, examples of abnormal operations can help validate model performance
 

Process Forecasting

image (5)Example Intelecy Forecasting Model

For predicting future behavior or outcomes, usually:

  • Lower frequency data may suffice (minutes aggregations)
  • Longer historical time frames become essential to capture seasonal patterns and rare events (12 months, 24 months, or even multiple years of historical data can be very helpful)
  • Contextual variables like ambient conditions, material properties, or upstream process states

 

Understanding Process Variation

One of the most challenging aspects of industrial machine learning is accounting for inherent process variation:

  • Different products through shared equipment: A single production line might handle multiple product variants, each with unique characteristics
  • Process phases: Even for a single product, different manufacturing stages (heating, cooling, curing) create distinct data patterns
  • Planned vs. unplanned variations: Distinguishing intentional process adjustments from unexpected deviations

Understanding the process deeply is crucial. For instance, knowing that a cleaning phase naturally involves temperature spikes prevents false anomaly alerts. This is usually identified by a process step number or product state code.

Similarly, recognizing when a measurement reflects a material change versus a sensor failure requires process knowledge.

Intelecy example - maintenance affecting dataimage (6)Data spikes due to maintenance breaks that result in poor model training and accuracy. These periods should be excluded from the training for optimal model performance.

 


Real-Time Data vs. Historian Systems

The timeliness of data creates another important distinction: real-time streaming data vs. historical data from historians.

The ideal protocol for exposing both real-time and historical data to AI platforms is OPC-UA (Open Platform Communications Unified Architecture) due to its standardized, platform-independent architecture and security features. It enables seamless interoperability between devices from different vendors, ensuring consistent and reliable data exchange.

 

Real-time streaming data

Usage of real-time streaming data is essential when:

  • Immediate responses are required (e.g. safety shutdowns, quality interventions)
  • Deploying systems that operate in production environments
  • Implementing closed-loop control or feedback for optimization

Intelecy Use Case example image (7)Example forecast model that emits 12 evenly spaced predictions at 25 second intervals. To reliably enable such frequent predictions, real-time streaming data is required for all input tags.

 

Historian data

Historical data stored in historians is suitable for:

  • Retrospective analysis of trends and patterns using Intelecy Data Explorer
  • Building models that don't require instantaneous decisions
  • Testing hypotheses about process relationships

image (8)Trend view - how the sensor’s tag data is changing over time

The decision about what data to collect in your Supervisory Control and Data Acquisition (SCADA) or Distributed Control System (DCS) has long-lasting consequences. Data not captured in real-time is often lost forever, so thoughtful planning of data collection is crucial when setting up these systems. 

image (9)15 years of data - invaluable for identifying long-term trends and supporting data-driven decision-making

Intelecy Use Case  - ProOptima analysis
ProOptima is a tool that combines advanced machine learning with detailed statistical analysis for a complete view of manufacturing operations. The availability of many years of historical data can reveal hidden insights and patterns.

ProOptima - Importance RankingResult tabs after running the analysis, with the Importance Ranking tab selected.

ProOptima Data DistributionsData Distribution analysis showing the distribution of the analysis data in each of your configured target categories.

Tag Selection in SCADA/DCS

Storage costs and system performance matter when it comes to tag storing decisions. For AI applications, make sure to focus on storing signals that:

  • Represent product quality (e.g., temperature, pH, pressure) or track operational conditions (e.g., valve states, pump speeds). Make sure the following measurement values are captured: .PV, .SPT (.SP), .OUT (.OP), and .Mode (.MD) values
  • Provide context (e.g., batch ID, operator inputs, status codes, upper limits)

When determining the number of tags to store, it's preferable to begin with a broader set. Reducing the number later is easier than recovering data that was never collected.

 

The Hidden Impact of Data Aggregation and Deadbanding

Industrial data rarely arrives in its raw form. Before it ever reaches any analysis tools, it typically undergoes transformations that profoundly affect what insights remain accessible.

Deadbanding: The Invisible Filter

Deadbanding occurs when small variations in measurements are deliberately ignored to prevent system overreaction. While essential for operational stability, this practice creates blind spots in the data.

 

Intelecy Use Case example - deadband impact

A biorefinery discovered that their anomaly detection model's performance significantly improved after gaining access to raw ISO brightness data (%), which revealed subtle oscillation patterns in a pulp stream that preceded quality issues.

image (10)

On one hand, applying deadbanding helps to reduce unnecessary data, allowing the system to focus on more significant changes. On the other hand, if the deadband is set too broadly, it could filter out important, subtle patterns. Understanding whether deadbanding is in use, the specific thresholds set, and whether the raw data is still accessible is necessary for tailoring machine learning models effectively.

 

Aggregation Methods Matter

How data is summarized dramatically impacts what questions it can answer:

  • Pre-storage aggregation (summarizing before storage) makes data management more efficient but permanently removes detail
  • Post-storage aggregation preserves access to raw data but requires additional storage capacity

The choice of aggregation method also matters greatly:

  • Mean values can obscure brief but critical excursions
  • Min/max values capture extremes but miss distribution patterns
  • Modal values highlight common states but may miss important transitions

Intelecy Use Case example - fermentation monitoring for yogurt and cheese

A dairy plant relied on batch-level pH averages to monitor fermentation. However, this approach missed rapid pH drops in the early stages, resulting in inconsistent flavor and texture across batches. By switching to 1-minute interval data and monitoring how long it took to reach specific pH thresholds, operators could better control starter culture activity. This led to more consistent fermentation times and higher product quality.

Understanding exactly how data has been filtered and aggregated is essential for determining what questions it can reliably answer.

 

Managing Outliers and Missing Data

Ensuring reliable machine learning model performance requires that the majority of input data falls within a "normal" range, meaning it should be free from excessive outliers or missing values. While models can handle some irregularities, high levels of noise or missing information can negatively impact accuracy and reliability. Crucially, the ability to effectively handle such issues depends on understanding the context of the data.

Context-Aware Data Handling

Knowing the operational context allows one to distinguish between truly anomalous behavior and explainable deviations. For example, sensor spikes caused by maintenance events or hardware resets may appear as outliers but are predictable and can be safely excluded from training datasets. Similarly, missing data during planned maintenance windows or network outages can be filtered out without harming model quality. By embedding domain knowledge into preprocessing, the impact of noisy or incomplete data can be significantly reduced.

Outlier Impact

A dataset containing too many extreme values can interfere with the model training process and result in poor model precision. For example, in a temperature monitoring system, occasional sensor failures could record unrealistic spikes that should not influence model predictions.

image (11)Sensor tag

image (12)The same sensor tag with extreme outliers

Intelecy example — outlier impactimage (13)Random data spikes affecting forecast model training result in poor model accuracy

 

Missing Data Impact

If a significant portion of the dataset is missing, the model may struggle to learn patterns or make accurate predictions. The impact depends on whether the missing data is random or follows a systematic pattern. For instance, in a predictive maintenance system for industrial equipment, if sensor readings are consistently missing during high-load operations, the model may fail to predict failures accurately in those conditions.

image (14)2 weeks of missing data - this period should be excluded from training 

However, identifying and excluding these periods - such as known maintenance windows or sensor downtimes - can improve data quality and help the model focus on valid patterns.

Intelecy example - partial outages

image (15)

One day of missing data for one of the key input tags while others remain intact. This can be particularly problematic because the model may be penalized for making a "correct" prediction based on a normally strong correlation that temporarily breaks down due to the missing input. Identifying and excluding such partial outages - where critical signals are absent - can be crucial to maintaining model integrity.

 

Process-Specific Tolerance Levels

The acceptable level of missing data and outliers depends on the process being analyzed. Some industrial systems naturally experience fluctuations, while others require high precision. If the percentage of missing or anomalous data is too high, the model's predictions may become unreliable.

 

The Reality: Nobody Has Perfect Data

After working with dozens of manufacturers, we've encountered a universal truth: nobody starts with perfect data. And that's okay.

Industrial data is messy by nature - sensors fail, networks drop packets, timestamps drift, and calibrations become outdated. The path to effective machine learning is rarely linear:

  • It's difficult to know exactly what data you'll need before you start experimenting
  • Initial models often reveal gaps in data collection
  • Each iteration provides insights about which additional data points could improve performance

The journey toward better data and better models is continuous. Organizations that succeed don't wait for perfect conditions but start with what they have and improve iteratively.

 

Conclusion

Good data for industrial machine learning isn't defined by volume alone but by relevance, completeness, timeliness, and context. Understanding your specific goals, process dynamics, and operational constraints is essential to defining what "good data" means for your application.

While perfect data may be unattainable, the pursuit of better data is a journey worth taking. Each improvement in data quality unlocks new capabilities and insights that can transform manufacturing operations. The most successful organizations embrace this journey, recognizing that the path to advanced analytics begins with a thoughtful approach to data.


 

Summary: Know What To Aim For

In the context of industrial machine learning, there is no one-size-fits-all definition of "good data" - but there is an ideal to aim for. Ultimately, it all comes back to the fundamentals discussed earlier in this article - data suitability depends on resolution, context, completeness and accuracy. Let’s revisit the key characteristics that the most effective data for AI applications typically share:

  1. High-resolution data - captured at the source, with seconds resolution and minimal to no aggregation or deadbanding to retain fine-grained process signals; exposed to AI platforms via OPC-UA for smooth integration
  2. Rich process context - includes additional metadata and indicators such as operational states (e.g., producing, idle, cleaning), mode values, batch IDs, and other contextual tags that help interpret the sensor data accurately
  3. Historical and real-time data - 12+ months of historical data, with all necessary tags that describe the underlying manufacturing process
  4. Accurate measurements with minimal loss or gaps - properly calibrated sensors to avoid reporting unrealistic values during outages or off-states; reduced number of outages, disconnections, or systematic periods of missing data or outliers

Intelecy brings deep expertise in helping industrial companies meet the "good data" standards described above. We can be an invaluable partner, by identifying gaps in data collection and building a strong foundation for reliable, insightful machine learning models. We offer expert guidance on what data to capture, store, and prioritize, ensuring your AI initiatives grow effectively and sustainably. Together, we'll transform your data challenges into competitive advantages that drive real business value.

Thank you for reading - and if any of this resonates, we’d love to hear from you.

Share on: