Apriori analysis for quality assurance

R
Apriori analysis
Priori Patterns for Posterior Perfection
Author

Carmine Minichini

Published

August 9, 2024

The Apriori algorithm

The Apriori 1 algorithm is a foundational data mining technique designed to discover frequent patterns and associations within datasets. Originally developed for market basket analysis — identifying which products customers commonly purchase together — it has since proven valuable across diverse domains including healthcare and web usage analysis.

The algorithm operates on a simple principle: it systematically examines data to identify which items frequently appear together, building from individual items to larger combinations based on their occurrence patterns.

Table 1: Example defect data structure
Case Defects
Case 1 [Defect 1, Defect 2, Defect 3]
Case 2 [Defect 1, Defect 2]
Case 3 [Defect 1, Defect 4]
Case 4 [Defect 1, Defect 4, Defect 5, Defect 6]

Application in quality assurance

In quality assurance, the Apriori algorithm transforms how we analyze defect patterns. Rather than treating each defect as an isolated incident, it reveals the interconnected nature of quality issues across process instances. For instance, discovering that interface errors and data validation failures frequently co-occur suggests these aren’t isolated problems but symptoms of a deeper integration issue.

The practical applications span multiple quality management areas:

  • Strategic Process Improvement: Instead of addressing defects randomly, teams can prioritize areas where multiple quality issues cluster together, maximizing the impact of improvement efforts.

  • Enhanced Root Cause Analysis: When certain defects consistently appear together, it signals shared underlying causes. This pattern recognition accelerates problem-solving by directing investigation toward common sources rather than treating symptoms separately.

  • Operational Efficiency: Understanding defect relationships allows teams to design remediation strategies that address multiple issues simultaneously, reducing redundant work and accelerating resolution times.

  • Predictive Quality Management: Once patterns are established, the presence of one defect type can serve as an early warning system for related issues, enabling proactive intervention before problems cascade.

About the data

This dataset represents an anonymized subset of real-world quality assurance data, providing an authentic foundation for demonstrating the Apriori algorithm’s practical applications. While the specific defect identifiers have been obscured, the underlying patterns and relationships remain intact, offering genuine insights into how defect associations manifest in actual production environments.

The three core metrics

The Apriori algorithm identifies meaningful relationships through three key metrics that work together to reveal patterns:

  • Support measures how frequently a item combination appears across all items. It answers: “How common is this pattern?” High support indicates widespread occurrence. For example, within quality management areas, this could help prioritise the most prevalent issues.

  • Confidence quantifies the reliability of a relationship by measuring how often item Y occurs when item X is present. It answers: “If I see item X, how likely am I to also find item Y?”

  • Lift determines whether items occur together more often than random chance would suggest.

These metrics work in combination to filter noise from meaningful patterns. For instance, a rule with high support ensures the pattern is frequent enough to matter, high confidence makes it reliable for prediction, and high lift confirms the relationship isn’t just coincidental.

Support

Mathematically defined as the proportion of transactions containing a specific itemset, support answers the fundamental question: “How often does this pattern appear?”

\text{support}({X}) = \frac{\text{Number of transactions containing } X}{\text{Total number of transactions}}

In quality assurance contexts, high support values indicate widespread defect patterns that warrant immediate attention due to their prevalence. For instance, if a defect combination has 20% support, it means this pattern occurs in one out of every five cases, suggesting a systematic issue requiring priority investigation. Support serves as the foundation for identifying frequent patterns before analyzing their relationships through confidence and lift metrics.

Confidence

The confidence of a rule {X} -> {Y} is defined as

\text{confidence}({X} \rightarrow {Y}) = P({Y} | {X})

where P({Y} | {X}) is the conditional probability of observing Y given that X has occurred. Mathematically, this is expressed as: P({Y} | {X}) = \frac{P(X \cap Y)}{P(X)}

where P(X \cap Y) represents the joint probability of both events occurring together, and P(X) is the marginal probability of event X. The confidence metric measures how reliable or trustworthy the association rule {X} -> {Y} is. It represents the proportion of transactions containing X that also contain Y. In other words, it answers the question: “When X occurs, how often does Y also occur?”

A confidence value closer to 1 (or 100%) indicates a stronger association between X and Y, meaning that if X is present in a transaction, we can be more confident that Y will also be present. Conversely, a confidence value closer to 0 suggests a weaker association, implying that the presence of X does not reliably indicate the presence of Y.

Confidence: interpretation

Let’s examine rule #3 where:

  • X = D_{18}
  • Y = D_7

With \text{confidence}({X} \rightarrow {Y}) = 0.75, we can draw the following conclusions:

  1. Predictive Reliability: The 75% confidence indicates that when defect D_{18} occurs, there’s a three-in-four chance that defect D_7 will also be present. This makes D_{18} a reliable predictor for D_7.

  2. Shared Root Cause: Such a strong association suggests these defects likely stem from a common underlying issue rather than independent failures. This points to systemic problems in related process components.

Lift

The lift of a rule {X} -> {Y} is defined as

\text{lift}({X} \rightarrow {Y}) = \frac{P({Y} | {X})}{P({Y})}

where P({Y} | {X}) is the conditional probability of observing Y given that X has occurred, and P({Y}) is the marginal probability of observing Y. A lift value greater than 1 indicates that the co-occurrence of X and Y is higher than expected if they were statistically independent, while a lift value less than 1 indicates that the co-occurrence is lower than expected.

Lift: interpretation

Continuing with rule #3 where:

  • X = D_{18}
  • Y = D_7

The probability of defect D_7 occurring, without considering any other factors, is 5\%. In other words, in 5\% of all cases, defect D_7 is present.

However, when you look at cases where defect D_{18} is present, you find that the probability of defect D_7 occurring is \sim 41.9 \%

Using the lift formula, we can calculate the lift of the rule D_{18}D_7 as follows:

\text{lift}({X} \rightarrow {Y}) = \frac{P({Y} | {X})}{P({Y})} = \frac{0.419}{0.05} = 8.39

The lift value of 8.39 indicates that when defect D_{18} is present, the probability of defect D_7 occurring is 8.39 times higher than the probability of defect D_7 occurring without considering the presence of D_{18}.

This suggests that the occurrence of D_{18} and D_7 together is not independent or random, but rather there is a strong positive association between the two defects. This information can be used to further investigate the relationship between D_{18} and D_7 and potentially uncover underlying factors or dependencies that contribute to their co-occurrence.

An example in R

The following implementation demonstrates how to apply Apriori analysis to defect data using R. The interactive table below shows the discovered association rules, with color-coded metrics to highlight the strongest patterns. The network visualization reveals the relationships between defects.

rules <- apriori(trans,
                 parameter = list(supp=3/length(items_list),
                                  conf=0.1,
                                  target= "rules"),
                 control = list(verbose=FALSE))
1
3/length(items_list) means minimum 3 out of total transactions must contain an itemset for it to be considered frequent
2
conf=0.1 means the minimum confidence threshold is set to 0.1 or 10%. This filters out association rules where the consequent (right-hand side) occurs less than 10% of the times when the antecedent (left-hand side) occurs
plot(rules,
     method = "graph",
     engine = "htmlwidget",
     shading = "confidence")
1
Nodes with higher confidence rules will be shaded darker, allowing you to visually identify the most confident association rules at a glance