Apriori analysis for quality assurance

Apriori analysis

Priori Patterns for Posterior Perfection

Author

Carmine Minichini

Published

August 9, 2024

The Apriori algorithm is a data mining technique that can help identify frequent patterns and associations among defects or issues encountered during process management activities. Instead of analyzing individual defects in isolation, the Apriori ¹ algorithm allows us to uncover relationships between different types of defects or issues that often occur together.

\[ \begin{aligned} \begin{array}{c|c} \hline \text { Case } & \text { Defects } \\ \hline Case \ 1 & [Defect \ 1, Defect \ 2,Defect \ 3] \\ Case \ 2 & [Defect \ 1, Defect \ 2] \\ Case \ 3 & [Defect \ 1, Defect \ 4] \\ Case \ 4 & [Defect \ 1, Defect \ 4,Defect \ 5,Defect \ 6] \\ \end{array} \end{aligned} \]

In the context of process management quality assurance, the algorithm works as follows:

It analyzes the defect or issue data to identify the most frequently occurring individual defects or issues across multiple process instances or scenarios.
It then looks for combinations of defects or issues that frequently co-occur, starting with pairs, then triplets, and so on.
The algorithm calculates two key metrics for each defect or issue combination:
- Support: The proportion of process instances or scenarios in which a particular combination of defects or issues was observed.
- Confidence: The likelihood that if one defect or issue is present, the other defect(s) or issue(s) in the combination will also be found.

By analyzing these metrics, the Apriori algorithm can reveal meaningful associations between different types of defects or issues in the process management context. For example, it may uncover that a particular defect X and issue Y often occur together with high confidence, suggesting a potential root cause or shared underlying problem in the process.

These insights can be valuable for the quality assurance team in several ways:

Prioritizing process improvement efforts: Defect and issue associations can help prioritize process areas or scenarios that are more likely to encounter multiple defects or issues simultaneously, enabling targeted improvement initiatives.
Root cause analysis: Identifying strong associations between defects and issues can provide clues about potential root causes or shared underlying problems in the processes, leading to more effective defect prevention and remediation strategies.
Process optimization: By understanding defect and issue co-occurrence patterns, the team can optimize processes to address multiple defects or issues more efficiently, reducing redundant efforts.
Defect and issue prediction: The identified associations can potentially be used to predict the likelihood of encountering certain defects or issues based on the presence of others, enabling proactive mitigation measures.

rules <- apriori(trans,
                 parameter = list(supp=3/length(items_list),
                                  conf=0.1,
                                  target= "rules"),
                 control = list(verbose=FALSE))

1: 3/length(df) means minimum 3 out of total transactions must contain an itemset for it to be considered frequent
2: conf=0.1 means the minimum confidence threshold is set to 0.1 or 10%. This filters out association rules where the consequent (right-hand side) occurs less than 10% of the times when the antecedent (left-hand side) occurs

Lift

The lift of a rule {X} -> {Y} is defined as

\[\text{lift}({X} \rightarrow {Y}) = \frac{P({Y} | {X})}{P({Y})}\]

where \(P({Y} | {X})\) is the conditional probability of observing \(Y\) given \(X\), and \(P({Y})\) is the probability of observing \(Y\). A lift value greater than 1 indicates that the co-occurrence of \(X\) and \(Y\) is higher than expected if they were statistically independent, while a lift value less than 1 indicates that the co-occurrence is lower than expected.

Lift: interpretation

Let’s define the following:

\(X\) = \(D_{18}\)

\(Y\) = \(D_7\)

The probability of the defect \(D_7\) occurring, without considering any other factors, is 5%. In other words, in 5% of all cases, the defect \(D_7\) is present.

However, when you look at cases where the defect \(D_{18}\) is present, you find that the probability of the defect \(D_7\) occurring is ~41.9%.

Using the lift formula, we can calculate the lift of the rule “\(D_{18}\) → \(D_7\)” as follows:

\[\text{lift}({X} \rightarrow {Y}) = \frac{P({Y} | {X})}{P({Y})} = \frac{0.419}{0.05} = 8.39\]

The lift value of 8.39 indicates that when the defect \(D_{18}\) is present, the probability of the defect \(D_7\) occurring is 8.39 times higher than the probability of the defect \(D_7\) occurring without considering the presence of \(D_{18}\).

In other words, if the defect \(D_{18}\) is present, it is a strong indicator or warning sign that the defect \(D_7\) is likely to occur as well. This positive correlation between the two defects could be valuable for identifying root causes, implementing preventive measures, or prioritizing process improvements in the system or process where these defects are observed.

The lift value greater than 1 suggests that the occurrence of \(D_{18}\) and \(D_7\) together is not independent or random, but rather there is a strong positive association between the two defects. This information can be used to further investigate the relationship between \(D_{18}\) and \(D_7\) and potentially uncover underlying factors or dependencies that contribute to their co-occurrence.

Confidence

The confidence of a rule {X} -> {Y} is defined as

\[\text{confidence}({X} \rightarrow {Y}) = P({Y} | {X})\]

where \(P({Y} | {X})\) is the conditional probability of observing \(Y\) given \(X\).

The confidence metric measures how reliable or trustworthy the association rule {X} -> {Y} is. It represents the proportion of transactions containing \(X\) that also contain \(Y\). In other words, it answers the question: “When \(X\) occurs, how often does \(Y\) also occur?”

A confidence value closer to 1 (or 100%) indicates a stronger association between \(X\) and \(Y\), meaning that if \(X\) is present in a transaction, we can be more confident that \(Y\) will also be present. Conversely, a confidence value closer to 0 suggests a weaker association, implying that the presence of \(X\) does not reliably indicate the presence of \(Y\).

plot(rules,
     method = "graph",
     engine = "htmlwidget",
     shading = "confidence")

1: Nodes with higher confidence rules will be shaded darker, allowing you to visually identify the most confident association rules at a glance

Confidence: interpretation

Based on the confidence value of 0.75 for the association rule X → Y, where:

\(X\) = \(D_{18}\)

\(Y\) = \(D_7\)

We can draw the following conclusions:

Strong association: A confidence value of 0.75 indicates a relatively strong association between the two variables. This means that when \(D_{18}\) error occurs, there is a 75% probability that the \(D_7\) will follow as error.
Process improvement opportunity: The strong association between X and Y could indicate potential issues or inefficiencies in the process. It may be beneficial to review and improve the process to prevent errors.

Footnotes

Apriori wikipedia page ↩︎