Identifying causeandeffect relationships of manufacturing errors using sequencetosequence learning
In this section we introduce our vehicle manufacturing analyzes system (VMAS), which we developed according to the crossindustry standard process for data mining (CRISPDM)^{23}. Our usecase has two separate databases that store cycle times and error reports dates respectively. The PDA system in our usecase registers and stores action duration tuples or in the cycle times database. The data are processed by our VMAS, which consists of two main components: 1.) an error classification module for identifying source and knockon errors within our dataset; and 2.) a duration prediction module, trained to predict the time required for no future actions. We describe each component in detail below and a flowchart can be found in Fig. 3.
Module 1: error classification
We begin with an actions dataset (mathcal{D}_a) and an error reports database that stores timestamped error logs as well as the duration of the logged errors. each sample (x in mathcal {D}_a)is a sequence of action duration tuples (x = (u_0, u_{1}, u_{2}, ldots , u_n))where no is the number of actions executed during a complete sequence. The error classification module of our workflow allows us to identify the most significant errors within our dataset, and distinguishes source from knockon errors. More specifically, this module allows us to split samples from our dataset into four subsets: normal (mathcal{D}_{n})source errors (mathcal{D}_{s})knockon errors (mathcal{D}_{k}) and misc (mathcal{D}_{m}). This splitting of the dataset into subsets serves two purposes: i.) The classification in (mathcal{D}_{s}) and (mathcal{D}_{k}) helps the stakeholder to conduct an automated analysis of all actions and it eliminates the need for manual and often time consuming inspection of actions; ii.) During preliminary trials we found that samples from (mathcal{D}_{m}) are exceedingly rare and disturb the training of the seq2seq models. Therefore, the error classification module also provides a valuable preprocessing step prior to training our seq2seq models to predict future delays. Below we first discuss our approach for labeling our samples, and then formally define the conditions for a sequence x to belong to one of the four subsets. We note that for our VMAS there is an assumption that all source errors are logged errors.
labeling We use the maximum likelihood estimation (MLE) method for the labeling of anomalous behavior. For each action aa normal (Gaussian) distribution is sought that fits the existing data distribution with respect to the frequency of each duration (for an example see Fig. 2).
The density function of the normal distribution contains two parameters: the expected value μ and standard deviation σ, which determine the shape of the density function and the probability corresponding to a point in the distribution. The MLE method is a parametric estimation procedure that finds μ and σ that seem most plausible for the distribution of the observation z ^{24}:
$$begin{aligned} f(z mid mu , sigma ^2) = frac{1}{sqrt{2pi sigma ^2}} exp left( frac{(z mu )^2}{2sigma ^2}right) . end{aligned}$$
(1)
The density function describes the magnitude of the probability of z coming from a distribution with μ and σ. The joint density function can be factored as follows:
$$begin{aligned} f(z_1, z_2, ldots , z_n mid vartheta ) = Pi ^n_{i=1}f(z_i mid vartheta ) end{aligned}$$
(2)
For a fixed observed variable, the joint density function of z can be interpreted. This leads to likelihood function:
$$begin{aligned} L(vartheta ) = Pi ^n_{i=1}f_vartheta (z_i) end{aligned}$$
(3)
The value of (vartheta ) is sought for which the sample values (z_1, z_2, ldots, z_n) have the largest density function. Therefore, the higher the likelihood, the more plausible a parameter value (vartheta ) es. As long as the likelihood function is differentiable, the maximum of the function can be determined. Thus, the parameters μ and σ can be obtained.
Next, we seek to identify high frequency peaks with respect to the durations (d^y) for an action athat exceeds the nominal duration (d^a_{norm}). We are interested in significant errors, where we use the MLE threshold to determine if an error is significant or not. We denote significant errors as (d^a_{next}). These abnormal and distinct duration are indicating a recurring behaviour. We formally define the criteria for each subset below:

Source errors are samples where for each complete sequence x, we have at least one action duration that is considered critical, of statistical significance, and is accompanied by an error message. More formally: a complete action sequence x is considered a source error sequence (x in mathcal {D}_{s}) if there exists an action duration tuple (u in x)where the duration is (d^a_{next}) and there is a corresponding error message in the error reports database.

Knock on errors meet the same criteria as source errors, but lack an accompanying error message for (d^a_{next}). Therefore, a complete action sequence x is considered a knockon error sequence (x in mathcal {D}_{s}) if there exists an action duration tuple (u in x)where the duration is (d^a_{next}) and there is not a corresponding error message in in the error reports database.

Normal samples don’t include (d^a_{next}). Therefore, a complete sequence x is considered a regular sequence (x in mathcal {D}_{n}) if for all (u in x) there does not exist a duration (d^a_{next}).

Misc contains two types of complete action sequences: i.) where for an action or there is a duration (d^a_{next}) that is above a defined global threshold (d^a_{globalmax}), meaning the duration is either intended (e.g., the production line is paused), or staff are handling them; and ii.) where x consists only of duration d that exceed the nominal duration, but each of low significance, i. e., not exceeding the corresponding MLE threshold.
It is worth noting that (mathscr {D}_{n} cup mathscr {D}_{s} cup mathscr {D}_{k}) may contain individual (d^y) above the nominal duration, but below the threshold determined by the MLE, and therefore are errors of low significance. There can also exist an intersection between source and knockon errors. Furthermore, the labeling of knockon errors is deliberately modular, as different methods can be applied here based on the stakeholder’s requirements. Naturally this will impact the subsequent training of our seq2seq models, and therefore their predictions.
Module 2: action duration prediction
While our error classification module assigns labels to past errors, our second module focuses on the prediction of future errors. Upon removing misc samples, we use our dataset to train seq2seq models to predict knockon errors. Given a sequence of action duration tuples our objective is to predict the time required by each of the next no steps. We therefore convert the data received from the error classification module into a dataset containing pairs ((x, y) in mathscr {D})where each x is a sequence of action duration tuples (x = (u_{tn}, u_{tn+1}, u_{tn+2}, ldots, u_t))and Y is the duration of the no actions that follow (y = (d^a_{t}, d^a_{t+1}, d^a_{t+2}, ldots, d^a_{t+n})). Using these data, we train and evaluate popular seq2seq models, including LSTM^{25}GRU^{14} and the transformer^{17}. The later is of particular interest, as it represents the current stateoftheart for a number of seq2seq tasks. Vaswani et al.^{17} presented the Transformer architecture for the Natural Language Processing (NLP) or Transducer task domain. Previous RNN/CNN architectures pose a natural obstacle to the parallelization of sequences. The Transformer architecture replaces the recurrent architecture by its attention mechanism and encodes the symbolic position in the sequence. This relates two distant sequences of input and output, which in turn can take place in parallel. The time for training is therefore significantly shortened. At the same time, the sequential computation is reduced and the complexity EITHER(1) of dependencies between two symbols, regardless of their distance from each other in the sequence, remains the same^{17}. Next we consider a novel metric for fairly evaluating models of different architectures–in particular regarding the number of steps no–using a single scalar (Fig. 3).