Chemistry Behind Wine And Food Pairings¶
Group Members: Hrishi Kabra and Kiet Huynh
Project Website: https://akjieettt.github.io/data-science-final-project/
Github Repository: https://github.com/akjieettt/data-science-final-project
Information & Motivation¶
Project Overview¶
Our project will investigate the relationship between chemical composition and wine quality, focusing on identifying which physicochemical properties most strongly predict wine quality and how these relationships differ between red and white wines. Our project aims to provide insights into wine production that could benefit winemakers and consumers.
Our project will investigate the relationship between the chemical composition of wines and the different foods it pairs with. We will focus on identifying the physiochemical properties that most strongly influence the foods it pairs with. Our project aims to provide insights into wine production that could benefit winemakers and consumers.
Research Questions¶
Primary Question: Can the chemical properties of a wine predict optimal food pairings, and do these predictions align with expert pairing assessments.
Secondary Questions:
- How do specific chemical properties correlate with pairing quality scores across food categories.
- Can we build a classification model to predict food categories from chemical profiles?
- Do high-quality wines (≥7) demonstrate different pairing patterns than lower-quality wines?
Background¶
Wine and food pairing is an art form that relies on centuries of culinary wisdom and sommeleir expertise. There are some traditional pairing patterns that have held true, red wine with red meats, white wine with seafoods, sweeter wine with desserts. Are these pairings based on tradition or is there some chemistry behind them that makes it a better match.
Acidity, sugar, alcohol, and sulphur levels shape how a wine feels and tastes. For example, alcohol adds warmth and body, while acidity changes the freshness and balance of a wine.
By studying these chemical properties, we can better understand what makes good wine and food pairings instead of solely relying on human taste tests.
Motivation For This Project¶
Wine production is an art and science where chemical composition determines quality. Understanding these relationships through data science can provide valuable insights for:
- Consumers: Choose wines for meals based on chemistry, not just color or region
- Restaurants: Optimize wine lists to complement menu offerings using data-driven principles
- Winemakers: Understand how production decisions affect both quality and food compatibility
- Sommeliers: Validate traditional pairing wisdom with quantitative evidence
Collaboration Plan¶
Team Coordination:
- Set up a private GitHub repository to coordinate all code, share datasets, and track progress
- Each member works on separate branches to implement features, which are merged via pull requests after code review to ensure consistency
Technologies Used:
- Version Control: Git and GitHub for source code management and collaboration
- Development Environment: Visual Studio Code Live Share, Google Colab, and Jupyter Notebooks for data analysis and prototyping
- Communication Tools: Small Family Collaboration Hub for offline discussions, FaceTime for online discussions and Google Docs for shared notes
Meeting Schedule:
- Consistently meet offline 2 - 3 times per week for 1 - 3 hours per session to discuss progress, solve problems, and coordinate tasks
- Outside of scheduled meetings, we communicate asynchronously via iMessage to stay aligned and share updates
Task Management:
- Tasks are divided based on expertise and interest
- Progress is tracked via a shared progress table (in a spreadsheet) to ensure deadlines are met and responsibilities are clear
Data Sources¶
Our first dataset is the Wine Quality dataset from UC Irvine's Machine Learning Repository. The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine:
winequality-red: Data About Red Wines
- Source: UC Irvine Machine Learning Repository
- Coverage: Data about 1,599 different Red Wines
- Output: quality rating (0–10) assigned by tasters
winequality-white: Data About White Wines
- Source: UC Irvine Machine Learning Repository
- Coverage: Data about 4,898 different White Wines
- Output: quality rating (0–10) assigned by tasters
Our next dataset is a Wine and Food Pairing dataset from Kaggle. This contains Wine and Food pairings scored from 1 (Terrible) to 5 (Excellent) based on compatibility of wine style, food flavor profile, and more:
- wine_food_pairings: Data about Wine and Food pairings
- Source: Kaggle
- Coverage: Data about 34,933 different White Wines
- Output: Pairing quality from 1-5 (Terrible to Excellent)
Integration Strategy¶
We wish to connect the two datasets by matching wine categories, red and white, and creating chemical profiles based on acidity, sweetness, and body levels. After this we could analyze how these chemical properties correlate with pairing success and test if quality score affects pairing versatility.
Imports and Loading the Data¶
!git clone https://github.com/akjieettt/data-science-final-project.git
%cd data-science-final-project/
Cloning into 'data-science-final-project'... remote: Enumerating objects: 177, done. remote: Counting objects: 100% (68/68), done. remote: Compressing objects: 100% (64/64), done. remote: Total 177 (delta 41), reused 4 (delta 4), pack-reused 109 (from 1) Receiving objects: 100% (177/177), 4.93 MiB | 14.27 MiB/s, done. Resolving deltas: 100% (81/81), done. /content/data-science-final-project
Loading the Wine Quality Data¶
# Import necessary libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
# Load the wine quality datasets
df_reds = pd.read_csv("data/winequality-red.csv", sep=";")
df_whites = pd.read_csv("data/winequality-white.csv", sep=";")
# Add wine type identifiers for red and white
df_reds['type'] = 'red'
df_whites['type'] = 'white'
# Combine the red and white wine datasets into a single df
df_wines = pd.concat([df_reds, df_whites], ignore_index=True)
# Printing the number of wines in each dataset
print(f"Total Number of Red Wines: {len(df_reds)}")
print(f"Total Number of White Wines: {len(df_whites)}")
print(f"Total Number of Wines: {len(df_wines)}")
print(f"Total Number of NaNs:")
display(df_wines.isna().sum())
print("\nThe Dataset:")
display(df_wines.head())
Total Number of Red Wines: 1599 Total Number of White Wines: 4898 Total Number of Wines: 6497 Total Number of NaNs:
| 0 | |
|---|---|
| fixed acidity | 0 |
| volatile acidity | 0 |
| citric acid | 0 |
| residual sugar | 0 |
| chlorides | 0 |
| free sulfur dioxide | 0 |
| total sulfur dioxide | 0 |
| density | 0 |
| pH | 0 |
| sulphates | 0 |
| alcohol | 0 |
| quality | 0 |
| type | 0 |
The Dataset:
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 | red |
| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 | red |
| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 | red |
| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 | red |
| 4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 | red |
Wine Quality Dataset Overview¶
We can see above that the dataset has roughly 6500 different wines. It also has 13 different columns, shown below with its corresponding datatype:
- fixed acidity: Ratio
- volatile acidity: Ratio
- citric acid: Ratio
- residual sugar: Ratio
- chlorides: Ratio
- free sulfur dioxide: Ratio
- total sulfur dioxide: Ratio
- density: Ratio
- pH: Interval (logarithmic scale)
- sulphates: Ratio
- alcohol: Ratio
- quality: Ordinal
- type: Nominal
- wine_id: Nominal
Some More Information About The Dataset
- Total Samples: 6,497 wines
- Red Wines: 1,599 samples
- White Wines: 4,898 samples
- NaNs: Since there are no NaN's in this dataset, there is no need to modify it further for now.
Loading the Wine and Food Pairing Data¶
# Loading the wine and food pairing data
df_pairing = pd.read_csv("data/wine_food_pairings.csv")
print(f"Total Number of Wine and Food Pairings: {len(df_pairing)}\n")
print(f"Total Number of NaNs:")
display(df_pairing.isna().sum())
print("\nThe Dataset:")
display(df_pairing.head())
Total Number of Wine and Food Pairings: 34933 Total Number of NaNs:
| 0 | |
|---|---|
| wine_type | 0 |
| wine_category | 0 |
| food_item | 0 |
| food_category | 0 |
| cuisine | 0 |
| pairing_quality | 0 |
| quality_label | 0 |
| description | 0 |
The Dataset:
| wine_type | wine_category | food_item | food_category | cuisine | pairing_quality | quality_label | description | |
|---|---|---|---|---|---|---|---|---|
| 0 | Syrah/Shiraz | Red | smoked sausage | Smoky BBQ | Spanish | 2 | Poor | Heuristic pairing assessment. |
| 1 | Grenache | Red | charcuterie board | Salty Snack | French | 3 | Neutral | Heuristic pairing assessment. |
| 2 | Madeira | Fortified | lemon tart | Dessert | French | 4 | Good | Acidic wine balances acidic food. |
| 3 | Cabernet Sauvignon | Red | roast lamb | Red Meat | Mexican | 5 | Excellent | Tannic red complements red meat fat. |
| 4 | Viognier | White | duck à l’orange | Poultry | Vietnamese | 2 | Poor | Heuristic pairing assessment. |
Wine and Food Pairing Dataset Overview¶
We can see above that the dataset has roughly 3,500 different wine and food pairings. It also has 8 different columns, shown below with its corresponding datatype:
- wine type: Nominal
- wine category: Nominal
- food item: Nominal
- food category: Nominal
- cuisine: Nominal
- pairing quality: Ordinal
- quality label: Ordinal
- description: Nominal
Some More Information About The Dataset
- Total Wine and Food Pairings: 34,933 wines
- Wine Categories: The different wine categories are Dessert, Fortified, Red, Rosé, Sparkling, and White
- Food Categories: Acidic, Cheese, Creamy, Dessert, Pork, Poultry, Red Meat, Salty Snack, Seafood, Smoky BBQ, Spicy, Vegetarian
- NaNs: Since there are no NaN's in this dataset, there is no need to modify it further for now.
ETL and Integrating the data¶
Since our Wine Quality dataset only has information about Red and White wines, we are going to drop all the Wine and Food Pairings that are not Red or White wines. We are also going to change the quality_label to be only good, neutral, or bad.
# Drop the wine and food pairings that don't have a red or white wine
df_pairing = df_pairing[df_pairing['wine_category'].isin(['Red', 'White'])]
df_pairing.reset_index(drop=True, inplace=True)
display(df_pairing.groupby('wine_category').size())
# Make sure pairing_quality is numeric (int or float)
df_pairing['pairing_quality'] = pd.to_numeric(df_pairing['pairing_quality'])
quality_to_label = {
1: 'bad',
2: 'bad',
3: 'neutral',
4: 'good',
5: 'good'
}
df_pairing['quality_label'] = df_pairing['pairing_quality'].map(quality_to_label)
display(df_pairing.head())
| 0 | |
|---|---|
| wine_category | |
| Red | 12908 |
| White | 11193 |
| wine_type | wine_category | food_item | food_category | cuisine | pairing_quality | quality_label | description | |
|---|---|---|---|---|---|---|---|---|
| 0 | Syrah/Shiraz | Red | smoked sausage | Smoky BBQ | Spanish | 2 | bad | Heuristic pairing assessment. |
| 1 | Grenache | Red | charcuterie board | Salty Snack | French | 3 | neutral | Heuristic pairing assessment. |
| 2 | Cabernet Sauvignon | Red | roast lamb | Red Meat | Mexican | 5 | good | Tannic red complements red meat fat. |
| 3 | Viognier | White | duck à l’orange | Poultry | Vietnamese | 2 | bad | Heuristic pairing assessment. |
| 4 | Pinot Noir | Red | citrus salad | Acidic | Argentinian | 4 | good | Acidic wine balances acidic food. |
From this we can see that we have:
- 12,908 Red Wine and Food Pairings
- 11,193 White Wine and Food Pairings
All other wines other than red and white have not been taken into consideration, and the values in quality label have changed from Terrible, Poor, Neutral, Good, Excellent to only bad, neutral, good. Now we are going to create our chemical categories to connect it to the chemical properties dataset.
We can examine the description column from df_pairing to examine the relationship between the chemical properties of wine and the key characteristics of food
The description column provides brief explanations that justify why certain wine–food combinations succeed or fail. These descriptions capture common pairing heuristics (such as acidity balance, tannin–fat interactions, and flavor intensity matching) and also include a few deliberately bad examples for contrast.
The table acts as a reference for understanding the logic behind pairing ratings.
# The different descriptions
df_pairing.groupby('description').size()
| 0 | |
|---|---|
| description | |
| Acidic wine balances acidic food. | 1850 |
| Acidic wine balances acidic food.; Dry table wine clashes with dessert sweetness. | 153 |
| Crisp acidity suits seafood. | 587 |
| Deliberately bad pairing example for contrast. | 3845 |
| Delicate wine overwhelmed by red meat. | 1254 |
| Dry table wine clashes with dessert sweetness. | 441 |
| Heavy wine can dominate poultry. | 97 |
| Heuristic pairing assessment. | 9035 |
| High tannin intensifies spice heat. | 48 |
| Idealized perfect pairing example for contrast. | 3694 |
| Light red (Pinot) with salmon works. | 48 |
| Light/medium red suits richer poultry/pork. | 238 |
| Lighter wines fit poultry. | 287 |
| Low-acid wine seems flabby vs acids. | 190 |
| Off-dry sweetness calms spice. | 50 |
| Richer body matches creamy textures. | 294 |
| Tannic red complements red meat fat. | 780 |
| Tannic reds clash with delicate seafood. | 340 |
| Too lean for creamy dish. | 870 |
Extracting Chemical Profiles from Expert Descriptions¶
Our pairing dataset includes a limited set of standardized descriptions. Each one highlights a specific chemical charectiristic relevant to food pairing. By analyzing these descriptions, we can infer the wine's chemical profile.
- "Acidic wine balances acidic food" - High acidity wines
- "Crisp acidity suits seafood" - High acidity, lighter body
- "Tannic red complements red meat fat" - High tannins (sulphates)
- "Richer body matches creamy textures" - Full-bodied wines
- "Off-dry sweetness calms spice" - Wines with residual sugar
- "Delicate wine overwhelmed by red meat" - Light body, low tannins
We are creating four inference functions to extract a certain chemical property from each description. When descriptions don't explicitly mention something, we default to a middle value to avoid extreme assumptions.
# Function to infer the acidity of the wine
def infer_acidity(description):
d = description.lower()
if "low-acid" in d:
return "Low"
if "acidic" in d or "crisp acidity" in d:
return "High"
return "Medium"
# Function to infer the sweetness of the wine
def infer_sweetness(description):
d = description.lower()
if "off-dry" in d:
return "Off-Dry"
if "dry" in d:
return "Dry"
return "Off-Dry"
# Function to infer the body of the wine
def infer_body(description):
d = description.lower()
if "light" in d or "delicate" in d or "lean" in d:
return "Light"
if "heavy" in d or "rich" in d or "full" in d:
return "Full"
return "Medium"
# Function to infer the Tannin Proxy of the wine
def infer_tannin(description):
d = description.lower()
if "tannic" in d or "high tannin" in d:
return "High"
if "delicate wine overwhelmed" in d:
return "Low"
return "Medium"
# Applying these functions to get values for acidity, sweetness, body, and tannin proxy for the wines
df_pairing["acidity_level"] = df_pairing["description"].apply(infer_acidity)
df_pairing["sweetness_level"] = df_pairing["description"].apply(infer_sweetness)
df_pairing["body_level"] = df_pairing["description"].apply(infer_body)
df_pairing["tannin_proxy"] = df_pairing["description"].apply(infer_tannin)
df_pairing.head()
| wine_type | wine_category | food_item | food_category | cuisine | pairing_quality | quality_label | description | acidity_level | sweetness_level | body_level | tannin_proxy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Syrah/Shiraz | Red | smoked sausage | Smoky BBQ | Spanish | 2 | bad | Heuristic pairing assessment. | Medium | Off-Dry | Medium | Medium |
| 1 | Grenache | Red | charcuterie board | Salty Snack | French | 3 | neutral | Heuristic pairing assessment. | Medium | Off-Dry | Medium | Medium |
| 2 | Cabernet Sauvignon | Red | roast lamb | Red Meat | Mexican | 5 | good | Tannic red complements red meat fat. | Medium | Off-Dry | Medium | High |
| 3 | Viognier | White | duck à l’orange | Poultry | Vietnamese | 2 | bad | Heuristic pairing assessment. | Medium | Off-Dry | Medium | Medium |
| 4 | Pinot Noir | Red | citrus salad | Acidic | Argentinian | 4 | good | Acidic wine balances acidic food. | High | Off-Dry | Medium | Medium |
These inferred categories allow us to match the pairing dataset with our chemical dataset.
Creating Matching Chemical Categories in the Wine Quality Dataset¶
We need to create equivalent categories in the chemical properties dataset. This requires examining the actual chemical properties and determining appropriate thresholds.
We are first visualizing how the key chemical properties are distributed across red and white wines.
# Chemical property distributions by wine type
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
colors = {
'red': 'darkred',
'white': 'gold'
}
# Fixed Acidity
red_data = df_wines[df_wines['type'] == 'red']['fixed acidity']
white_data = df_wines[df_wines['type'] == 'white']['fixed acidity']
axes[0, 0].hist(red_data, bins=30, alpha=0.5, label='Red',
color=colors['red'], linewidth=0.5)
axes[0, 0].hist(white_data, bins=30, alpha=0.5, label='White',
color=colors['white'], linewidth=0.5)
axes[0, 0].set_title('Fixed Acidity Distribution', fontweight='bold', fontsize=12)
axes[0, 0].set_xlabel('Fixed Acidity', fontweight='bold')
axes[0, 0].set_ylabel('Count', fontweight='bold')
axes[0, 0].legend(loc='upper right')
axes[0, 0].grid(axis='y', alpha=0.3)
# Residual Sugar
red_data = df_wines[df_wines['type'] == 'red']['residual sugar']
white_data = df_wines[df_wines['type'] == 'white']['residual sugar']
axes[0, 1].hist(red_data, bins=30, alpha=0.5, label='Red',
color=colors['red'], linewidth=0.5)
axes[0, 1].hist(white_data, bins=30, alpha=0.5, label='White',
color=colors['white'], linewidth=0.5)
axes[0, 1].set_title('Residual Sugar Distribution', fontweight='bold', fontsize=12)
axes[0, 1].set_xlabel('Residual Sugar', fontweight='bold')
axes[0, 1].set_ylabel('Count', fontweight='bold')
axes[0, 1].legend(loc='upper right')
axes[0, 1].grid(axis='y', alpha=0.3)
# Alcohol %
red_data = df_wines[df_wines['type'] == 'red']['alcohol']
white_data = df_wines[df_wines['type'] == 'white']['alcohol']
axes[1, 0].hist(red_data, bins=30, alpha=0.5, label='Red',
color=colors['red'], linewidth=0.5)
axes[1, 0].hist(white_data, bins=30, alpha=0.5, label='White',
color=colors['white'], linewidth=0.5)
axes[1, 0].set_title('Alcohol % Distribution', fontweight='bold', fontsize=12)
axes[1, 0].set_xlabel('Alcohol', fontweight='bold')
axes[1, 0].set_ylabel('Count', fontweight='bold')
axes[1, 0].legend(loc='upper right')
axes[1, 0].grid(axis='y', alpha=0.3)
# Sulphates
red_data = df_wines[df_wines['type'] == 'red']['sulphates']
white_data = df_wines[df_wines['type'] == 'white']['sulphates']
axes[1, 1].hist(red_data, bins=30, alpha=0.5, label='Red',
color=colors['red'], linewidth=0.5)
axes[1, 1].hist(white_data, bins=30, alpha=0.5, label='White',
color=colors['white'], linewidth=0.5)
axes[1, 1].set_title('Sulphates Distribution', fontweight='bold', fontsize=12)
axes[1, 1].set_xlabel('Sulphates', fontweight='bold')
axes[1, 1].set_ylabel('Count', fontweight='bold')
axes[1, 1].legend(loc='upper right')
axes[1, 1].grid(axis='y', alpha=0.3)
plt.suptitle('Chemical Property Distributions by Wine Type',
fontweight='bold', fontsize=14, y=1.00)
plt.tight_layout()
plt.show()
From this distribution we can see that:
Fixed Acidity (to measure Acidity Level):
- White wines are in the higher range, with a sharp peak around 6.5-7.0 g/L
- Red wines have a lower and flatter distribution centered around 7.5-8.5 g/L
- There is a clear sepearation between the two, white wines are more acidic than reds.
- Selecting the bins:
- Low (<7): Primarily reds, some softer white wines
- Medium (7-9): Mixed selection, but more whites than reds.
- High (>9): Almost exclusively white wines.
Residual Sugar (to measure Sweetness Level):
- The distribution is extremely right skewed. Most wines are around 0-5 g/L.
- White wines show more variations.
- White wines generally have a higher residual sugar, red wines are all nearly dry.
- Selecting the bins:
- Dry (<2): Majority of all wine types
- Off-Dry (2-10): Mostly white wines
- Sweet (>10): More of the dessert wines
Alcohol Content (to measure Body Level):
- Red and white wines have their peak alcohol percentage around 9-10%.
- Higher alcohol percentage suggests a fuller body.
- Selecting the bins:
- Light (<10%): The more delicate wines
- Medium (10-11.5%): Pretty mixed
- Full (>11.5%): Predominantly red wines with a richer body
Sulphates (to measure Tannin Level):
- There is a clear separation between the two. Red wines typically have a higher tannin proxy than white wines
- White wines cluster around 0.4-0.5 g/L
- Red wines have a broader distribution from 0.4 - 0.8 in general.
- Selecting the bins
- Low (<0.5): Almost all white whines
- Medium (0.5-0.7): Mix of both
- High (>0.7): Exclusively reds with higher tannins
Creating Chemical Categories¶
Based on these distributions, we can categorize each wine:
# Create chemical categories for pairing analysis
df_wines['acidity_level'] = pd.cut(df_wines['fixed acidity'],
bins=[0, 7, 9, 15],
labels=['Low', 'Medium', 'High'])
df_wines['sweetness_level'] = pd.cut(df_wines['residual sugar'],
bins=[0, 2, 10, 100],
labels=['Dry', 'Off-Dry', 'Sweet'])
df_wines['body_level'] = pd.cut(df_wines['alcohol'],
bins=[0, 10, 11.5, 15],
labels=['Light', 'Medium', 'Full'])
df_wines['tannin_proxy'] = pd.cut(df_wines['sulphates'],
bins=[0, 0.5, 0.7, 2],
labels=['Low', 'Medium', 'High'])
df_wines.head()
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | type | acidity_level | sweetness_level | body_level | tannin_proxy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 | red | Medium | Dry | Light | Medium |
| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 | red | Medium | Off-Dry | Light | Medium |
| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 | red | Medium | Off-Dry | Light | Medium |
| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 | red | High | Dry | Light | Medium |
| 4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 | red | Medium | Dry | Light | Medium |
These categories now match the pairing dataset, allowing us to connect chemical measurements to food pairing recommendations. Each wine in the Chemical Properties dataset now has the precise chemical values and the general categories.
# Define ordinal encodings for style categories
acidity_map = {'Low': 0, 'Medium': 1, 'High': 2}
sweet_map = {'Dry': 0, 'Off-Dry': 1, 'Sweet': 2}
body_map = {'Light': 0, 'Medium': 1, 'Full': 2}
tannin_map = {'Low': 0, 'Medium': 1, 'High': 2}
def encode_style(df): # Convert columns into numeric codes to compute distance
encoded = {'acid_num': df['acidity_level'].map(acidity_map),
'sweet_num': df['sweetness_level'].map(sweet_map),
'body_num': df['body_level'].map(body_map),
'tannin_num': df['tannin_proxy'].map(tannin_map)}
encoded_df = pd.DataFrame(encoded)
return encoded_df
style_cols = ['acidity_level', 'sweetness_level', 'body_level', 'tannin_proxy']
# Aggregate df_pairing into style-level food lists per pairing_label
agg = (df_pairing
.groupby(style_cols + ['quality_label'])['food_category']
.apply(lambda s: sorted(set(s.dropna()))) # set to ensure no duplicates
.reset_index())
# Pivot table so that each col has good, neutral, bad
pivot = agg.pivot_table(index=style_cols,
columns='quality_label',
values='food_category',
aggfunc=lambda x: x).reset_index()
# Making sure that all exist
for col in ['good', 'neutral', 'bad']:
if col not in pivot.columns:
pivot[col] = None
# Clean rules table
pivot = pivot[['acidity_level', 'sweetness_level', 'body_level', 'tannin_proxy',
'good', 'neutral', 'bad']]
rules = pivot.copy()
rules.head()
| quality_label | acidity_level | sweetness_level | body_level | tannin_proxy | good | neutral | bad |
|---|---|---|---|---|---|---|---|
| 0 | High | Dry | Medium | Medium | [Dessert] | [Dessert] | [Dessert] |
| 1 | High | Off-Dry | Medium | Medium | [Acidic, Seafood] | [Acidic, Seafood] | NaN |
| 2 | Low | Off-Dry | Medium | Medium | NaN | [Acidic] | [Acidic] |
| 3 | Medium | Dry | Medium | Medium | NaN | [Dessert] | [Dessert] |
| 4 | Medium | Off-Dry | Full | Medium | [Creamy, Pork] | [Creamy, Pork, Poultry] | [Poultry] |
This creates a rules table where each row represents a unique chemical profile and the columns list the foods that pair well, neutrally, or poorly with that specific profile.
Matching Wines to Rules via Nearest Neighbor¶
Since not every combination appears in the pairing data, we use the euclidean distance to find the most similar style profile.
# Numerical encodings for rules and wines
rules_style_num = encode_style(rules)
wines_style_num = encode_style(df_wines)
# Convert to numpy to calculate distance
rules_X = rules_style_num.to_numpy()
wines_X = wines_style_num.to_numpy()
# Find index of closest style rule by euclidean distance
nearest_idx = []
for i in range(len(df_wines)):
diff = rules_X - wines_X[i]
dist = (diff ** 2).sum(axis=1)
nearest_idx.append(dist.argmin())
nearest_idx = np.array(nearest_idx)
This approach helps us get data even if we don't have an exact match. We use the closest profile.
Assigning Pairing Recommendations¶
Now we can assign each wine its pairing recommendations based on the matched rule:
# Convert rule columns to numpy arrays
good_arr = rules['good'].to_numpy()
neutral_arr = rules['neutral'].to_numpy()
bad_arr = rules['bad'].to_numpy()
# Use nearest_idx to pick the rule for each wine
good_vals = [g if isinstance(g, list) else []
for g in good_arr[nearest_idx]]
neutral_vals = [g if isinstance(g, list) else []
for g in neutral_arr[nearest_idx]]
bad_vals = [g if isinstance(g, list) else []
for g in bad_arr[nearest_idx]]
# Assign these cols to df_wines
df_wines['good_foods_to_pair_with'] = good_vals
df_wines['neutral_foods_to_pair_with'] = neutral_vals
df_wines['bad_foods_to_pair_with'] = bad_vals
df_wines[['good_foods_to_pair_with',
'neutral_foods_to_pair_with',
'bad_foods_to_pair_with']].head()
| good_foods_to_pair_with | neutral_foods_to_pair_with | bad_foods_to_pair_with | |
|---|---|---|---|
| 0 | [] | [Dessert] | [Dessert] |
| 1 | [Pork, Poultry, Seafood] | [Creamy, Pork, Poultry, Seafood] | [Creamy, Pork] |
| 2 | [Pork, Poultry, Seafood] | [Creamy, Pork, Poultry, Seafood] | [Creamy, Pork] |
| 3 | [Dessert] | [Dessert] | [Dessert] |
| 4 | [] | [Dessert] | [Dessert] |
Each wine now has three lists of food categories representing pairing quality.
Removing Duplicates¶
Now we need to clean this to ensure each food category appears in only one columns.
def clean_lists(row):
g = set(row['good_foods_to_pair_with'] or [])
n = set(row['neutral_foods_to_pair_with'] or []) - g
b = set(row['bad_foods_to_pair_with'] or []) - g - n
return pd.Series([list(g), list(n), list(b)],
index=['good_foods_to_pair_with', 'neutral_foods_to_pair_with', 'bad_foods_to_pair_with'])
df_wines[['good_foods_to_pair_with',
'neutral_foods_to_pair_with',
'bad_foods_to_pair_with']] = df_wines.apply(clean_lists, axis=1)
df_wines.head()
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | type | acidity_level | sweetness_level | body_level | tannin_proxy | good_foods_to_pair_with | neutral_foods_to_pair_with | bad_foods_to_pair_with | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 | red | Medium | Dry | Light | Medium | [] | [Dessert] | [] |
| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 | red | Medium | Off-Dry | Light | Medium | [Seafood, Pork, Poultry] | [Creamy] | [] |
| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 | red | Medium | Off-Dry | Light | Medium | [Seafood, Pork, Poultry] | [Creamy] | [] |
| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 | red | High | Dry | Light | Medium | [Dessert] | [] | [] |
| 4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 | red | Medium | Dry | Light | Medium | [] | [Dessert] | [] |
Now each wine in the dataset has a personalized food pairing recommendation based on its chemical profile.
Exploratory Data Analysis¶
Our goal is to understand how the chemical properties differ between wines, how the quality is affected by these properties, how the food categories the wine pairs with is influenced, and the pairing versatility across different wine qualities.
Chemical Properties by Wine Type¶
We begin by examining how the four key chemical categories used in pairing analysis (acidity, sweetness, body, and tannins) differ between red and white wines. These differences reflect the sensory structure of wines and influence traditional pairing rules.
# Chemical categories visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 9))
# Acidity levels by wine type
pd.crosstab(df_wines['type'], df_wines['acidity_level']).plot(
kind='bar', ax=axes[0, 0],
color=['lightblue', 'skyblue', 'darkblue'],
alpha=0.7, edgecolor='black'
)
axes[0, 0].set_title('Acidity Levels by Wine Type', fontweight='bold')
axes[0, 0].set_xlabel('Wine Type')
axes[0, 0].set_ylabel('Count')
axes[0, 0].tick_params(axis='x', rotation=0)
axes[0, 0].legend(title='Acidity', loc='upper right')
axes[0, 0].grid(axis='y', alpha=0.3)
# Sweetness levels by wine type
pd.crosstab(df_wines['type'], df_wines['sweetness_level']).plot(
kind='bar', ax=axes[0, 1],
color=['tan', 'wheat', 'gold'],
alpha=0.7, edgecolor='black'
)
axes[0, 1].set_title('Sweetness Levels by Wine Type', fontweight='bold')
axes[0, 1].set_xlabel('Wine Type')
axes[0, 1].set_ylabel('Count')
axes[0, 1].tick_params(axis='x', rotation=0)
axes[0, 1].legend(title='Sweetness', loc='upper right')
axes[0, 1].grid(axis='y', alpha=0.3)
# Body levels by wine type
pd.crosstab(df_wines['type'], df_wines['body_level']).plot(
kind='bar', ax=axes[1, 0],
color=['lightcoral', 'coral', 'darkred'],
alpha=0.7, edgecolor='black'
)
axes[1, 0].set_title('Body Levels by Wine Type', fontweight='bold')
axes[1, 0].set_xlabel('Wine Type')
axes[1, 0].set_ylabel('Count')
axes[1, 0].tick_params(axis='x', rotation=0)
axes[1, 0].legend(title='Body', loc='upper right')
axes[1, 0].grid(axis='y', alpha=0.3)
# Tannin levels by wine type
pd.crosstab(df_wines['type'], df_wines['tannin_proxy']).plot(
kind='bar', ax=axes[1, 1],
color=['lightgreen', 'mediumseagreen', 'darkgreen'],
alpha=0.7, edgecolor='black'
)
axes[1, 1].set_title('Tannin Levels by Wine Type', fontweight='bold')
axes[1, 1].set_xlabel('Wine Type')
axes[1, 1].set_ylabel('Count')
axes[1, 1].tick_params(axis='x', rotation=0)
axes[1, 1].legend(title='Tannin', loc='upper right')
axes[1, 1].grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
Summary
White wines exhibit higher counts in high acidity and lighter body levels
Red wines exhibit much higher tannin levels and fuller body
Sweetness is distributed differently across wine types, with whites more commonly Off-Dry or Sweet
Interpretation
These chemical differences align with established pairing logic:
Red wines’ higher tannins and fuller bodies make them ideal for red meat and hearty dishes
White wines’ higher acidity and lighter texture make them better suited for seafood, poultry, and creamy foods
Chemical Properties by Wine Quality¶
Wine quality is a high-level indicator of balance, complexity, and structure. Here, we visualize how chemical variables shift across quality scores (3–9) to see whether higher-quality wines exhibit distinct chemical signatures.
# Chemical properties vs Quality visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 9))
# Acidity vs Quality
df_wines.boxplot(column='fixed acidity', by='quality', ax=axes[0, 0])
axes[0, 0].set_title('Acidity by Quality Category', fontweight='bold')
axes[0, 0].set_xlabel('Quality Category')
axes[0, 0].set_ylabel('Fixed Acidity (g/L)')
plt.sca(axes[0, 0])
# Residual Sugar vs Quality
df_wines.boxplot(column='residual sugar', by='quality', ax=axes[0, 1])
axes[0, 1].set_title('Residual Sugar by Quality Category', fontweight='bold')
axes[0, 1].set_xlabel('Quality Category')
axes[0, 1].set_ylabel('Residual Sugar (g/L)')
axes[0, 1].set_ylim(0, 20) # Focus on main distribution
plt.sca(axes[0, 1])
# Alcohol vs Quality
df_wines.boxplot(column='alcohol', by='quality', ax=axes[1, 0])
axes[1, 0].set_title('Alcohol Content by Quality Category', fontweight='bold')
axes[1, 0].set_xlabel('Quality Category')
axes[1, 0].set_ylabel('Alcohol (%)')
plt.sca(axes[1, 0])
# Tannins vs Quality
df_wines.boxplot(column='sulphates', by='quality', ax=axes[1, 1])
axes[1, 1].set_title('Tannins by Quality Category', fontweight='bold')
axes[1, 1].set_xlabel('Quality Category')
axes[1, 1].set_ylabel('Sulphates (g/L)')
plt.sca(axes[1, 1])
plt.tight_layout()
plt.show()
Summary
Alcohol content rises slightly with quality.
Acidity, residual sugar, and tannin show wide internal variation
There is substantial overlap between quality levels
Interpretation
Quality influences chemistry but not strongly enough to cleanly separate wines
This suggests that quality alone cannot explain food pairing behavior: a more detailed view of chemistry is required
Wine Type vs Good Food Pairings¶
This heatmap visualizes how red and white wines distribute their GOOD food pairings across different food categories, providing a clear view of the pairing tendencies of each wine category and helps reveal whether certain foods are consistently better matches for reds, whites, or both.
# Explode good pairings: one row per (wine, food)
df_good = (
df_wines[["type", "good_foods_to_pair_with"]]
.explode("good_foods_to_pair_with")
.dropna(subset=["good_foods_to_pair_with"])
.rename(columns={
"type": "wine_category",
"good_foods_to_pair_with": "food_category"
})
)
# Crosstab counts: wine_category x food_category
cross = pd.crosstab(df_good["wine_category"], df_good["food_category"])
# Normalize rows => convert to proportions
cross_norm = cross.div(cross.sum(axis=1), axis=0)
# Heatmap (Normalized)
plt.figure(figsize=(12, 8))
sns.heatmap(
cross_norm,
cmap="Spectral",
annot=True,
fmt=".2f",
linewidths=0.5,
linecolor="white"
)
plt.xlabel("Food Category")
plt.ylabel("Wine Category")
plt.title("Wine Category vs Good Food Pairings")
plt.tight_layout()
plt.show()
Summary
Both red and white wines pair most often with Pork and Seafood
White wines show stronger associations with Creamy and Poultry dishes
Red wines show higher proportions for Acidic dishes
Very low proportions appear for Spicy, Smoky BBQ, Salty Snack, Vegetarian for both wine types
Interpretation
Wine category meaningfully shapes good pairing patterns, but not in strictly traditional ways (e.g., red wines pairing well with seafood)
The overlap across categories suggests that wine type alone cannot predict perfect pairings
This reinforces the need to consider wine chemistry (acidity, sweetness, body, tannins) to understand and model pairing behavior more accurately
Chemical Properties vs Good Food Pairings¶
To understand how specific wine characteristics influence pairing success, we examine how four key chemical properties (acidity, sweetness, body, and tannin level) relate to the distribution of good food pairings in the dataset. Each heatmap visualizes how often wines with a given chemical trait successfully pair with each food category.
# Explode good foods so every food pairing is a row
df_good = (
df_wines[["acidity_level","sweetness_level","body_level","tannin_proxy","good_foods_to_pair_with"]]
.explode("good_foods_to_pair_with")
.dropna(subset=["good_foods_to_pair_with"])
.rename(columns={"good_foods_to_pair_with": "food_category"})
)
# Chemical properties and matching color palettes
chem_props = {
"acidity_level": {
"title": "Acidity Level",
"colors": ["lightblue", "skyblue", "darkblue"]
},
"sweetness_level": {
"title": "Sweetness Level",
"colors": ["tan", "wheat", "gold"]
},
"body_level": {
"title": "Body Level",
"colors": ["lightcoral", "coral", "darkred"]
},
"tannin_proxy": {
"title": "Tannin Level",
"colors": ["lightgreen", "mediumseagreen", "darkgreen"]
}
}
# Subplot figure in a 2x2 grid
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
for ax, (col, cfg) in zip(axes.flat, chem_props.items()):
# Build crosstab
ct = pd.crosstab(df_good[col], df_good["food_category"])
# Heatmap
sns.heatmap(
ct,
cmap=sns.color_palette(cfg["colors"], as_cmap=True),
annot=True,
fmt="d",
linewidths=0.5,
linecolor="white",
cbar=False,
ax=ax
)
# Titles + Labels
ax.set_title(f"{cfg['title']} vs Food Category\n(Good Pairings)", fontweight="bold")
ax.set_xlabel("Food Category")
ax.set_ylabel(cfg["title"])
plt.tight_layout()
plt.show()
Summary
Acidity: Medium-acidity wines produce the most good pairings (especially with Pork and Seafood) while high acidity is more selective (Acidic dishes, Seafood)
Sweetness: Off-Dry wines dominate good pairings across all food types, whereas Dry and Sweet wines appear far less frequently
Body: Medium-bodied wines show the strongest and broadest pairing success, with Light body favoring Seafood and Full body aligning with Creamy and Pork dishes
Tannin: Medium tannin levels generate the most good pairings, while high and low tannin wines contribute minimally
Across all four traits, Pork and Seafood repeatedly stand out as the most common good-pairing foods, while categories like Spicy, Smoky BBQ, Salty Snack, and Vegetarian remain consistently low
Interpretation
Acidity: Balanced acidity enhances versatility, while high or low acidity narrows pairing effectiveness
Sweetness: A slight amount of sweetness significantly increases compatibility, making Off-Dry wines the most flexible across foods.
Body: Moderate body matches a wide range of dish intensities, demonstrating the importance of weight-matching in food–wine pairing.
Tannin: Moderate tannins are the most food-friendly; extreme tannin levels restrict which foods pair well.
Overall, these patterns show that balanced chemical properties, not wine category alone, shape pairing success, reinforcing the need to incorporate acidity, sweetness, body, and tannin into the predictive model
Frequency of Food Categories¶
Before pairing modeling, we must understand how frequently each food category appears. Class imbalance can strongly influence model performance.
plt.figure(figsize=(12, 6))
sns.countplot(data=df_pairing, x="food_category", hue="food_category", palette="tab20c")
plt.xticks(rotation=45, ha="right")
plt.xlabel("Food Category")
plt.ylabel("Count")
plt.title("Frequency of Food Categories in Pairing Dataset")
plt.tight_layout()
plt.show()
Summary
Red Meat, Cheese, and Acidic dishes are the most frequent food categories in the dataset
Categories like Vegetarian, Creamy, Poultry, and Seafood appear moderately often
Smoky BBQ, Salty Snack, Spicy, and Dessert show relatively low frequencies compared to the others
Interpretation
The dataset is imbalanced, with certain food categories (especially Red Meat and Cheese) appearing far more often than others
This imbalance may influence model training, making some food types easier to predict and others harder
The uneven distribution also reflects real-world pairing biases: common foods like red meat and cheese naturally appear more frequently in pairing records, while categories like dessert or spicy foods are less frequently paired with wine
Hypothesis¶
The chemical properties of a wine can be used to reliably classify the best matching food category for a given wine.
Why This Hypothesis Is Important¶
Choosing the right food pairing has a considerable impact on consumer experience, restaurant recommendations, and retail wine sales
A model that predicts food pairing based solely on chemistry would allow pairing suggestions even when wine type, region, or tasting notes are unavailable
Understanding which chemical traits drive pairing success provides insight into why certain wines pair well, not just that they do
Why This Hypothesis Is Reasonable (Based on our EDA)¶
Our EDA strongly supports this hypothesis:
Specific chemical traits show predictable food associations: the Chemical Properties vs Good Food Pairings heatmaps reveal clear relationships:
Higher Acidity => Acidic dishes & Seafood
Higher Sweetness => Pork, Creamy, and Spicy dishes
Full Body => Creamy and heavy dishes
Higher Tannin => Richer, protein-heavy foods
These patterns reflect real-world pairing logic, reinforcing that chemistry determines food compatibility
Wine category alone is insufficient: the Wine Category vs Good Food Pairings heatmap shows:
Red and white wines share many overlapping pairing categories
Some unexpected associations appear (e.g., red wines with seafood)
This indicates that wine “type” is too broad, and more granular chemical traits must be used to predict pairings
Food frequencies and versatility patterns support a learnable structure:
Pork and Seafood consistently appear as major good pairing foods across all chemical levels
Extreme categories (Smoky BBQ, Spicy, Dessert) remain consistently low
This means the dataset has stable patterns a model can learn from
Model: Classification - Predicting Food Category From Chemical Properties¶
Based on our EDA, we are creating a Multi Label K-Nearest Neighbours Classification model that leverages the chemical composition and food pairing datasets.
After completing our ETL, each wine in our dataset includes a list of foods that it pairs well with - good_foods_to_pair_with. This will be the multi-label target for our model. Our goal is to predict these food pairings directly from the wine's chemical composition.
Model Type: Multi-Label K-Nearest Neighbours Classifier
Independent Variables:
These are from the wine chemistry dataset (df_wines) and represent measurable physicochemical properties:
- Fixed acidity — relates to sourness and pairing with acidic foods
- Volatile acidity — aroma sharpness, influences light vs. rich food pairings
- Citric acid — enhances freshness, often linked to seafood pairings
- Residual sugar — sweetness level, important for dessert/spicy pairings
- Chlorides — minor impact but relates to saltiness perception
- Free sulfur dioxide — freshness/preservation
- Total sulfur dioxide — stability and flavor
- Density — proxy for sugar and alcohol balance
- pH — overall acidity, critical in many food interactions
- Sulphates — proxy for tannin structure
- Alcohol % — determines body/weight and richness
Dependent Varable:
good_foods_to_pair_with- Food Category from the 12 categories there are (Red Meat, Seafood, Poultry, Cheese, Dessert, Spicy, Vegetarian, Acidic, Smoky BBQ, Salty Snack, Creamy, Pork)
A single wine can pair well with multiple foods: multi-label prediction.
Why this model?
- Multi-Label Output - Wines naturally pair with multiplie foods.
- Chemically similar wines give similar food pairings - KNN uses this intuition for wines with similar chemical properties.
- Food pairing patterns are complex and non-linear. KNN helps capture these.
Creating the K-Nearest Neighbours Classifier¶
Our dataset for the KNN model has:
- Features (X): the chemical properties of each wine
- Target (Y): the list of food categories that each wine pairs well with (
good_foods_to_pair_with)
We are only looking at wines with atleast one good pairing to ensure every training example has a non empty target.
# Imports
from sklearn.preprocessing import MultiLabelBinarizer, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold, cross_val_predict
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import accuracy_score, f1_score, hamming_loss, jaccard_score, precision_score, recall_score, classification_report
# Keep only wines with at least one good food pairing
df_ml = df_wines[df_wines["good_foods_to_pair_with"].apply(lambda x: len(x) > 0)].copy()
print("Number of wines with at least one good pairing:", len(df_ml))
# Chemical feature columns
feature_cols = ["fixed acidity", "volatile acidity", "citric acid", "residual sugar", "chlorides", "free sulfur dioxide", "total sulfur dioxide", "density", "pH", "sulphates", "alcohol"]
# Features and Target
X = df_ml[feature_cols].values
y_raw = df_ml["good_foods_to_pair_with"] # list-of-lists
# Multilabel binarizer
mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(y_raw)
print("Food label classes:", mlb.classes_)
Number of wines with at least one good pairing: 1359 Food label classes: ['Acidic' 'Cheese' 'Creamy' 'Dessert' 'Pork' 'Poultry' 'Red Meat' 'Salty Snack' 'Seafood' 'Smoky BBQ' 'Spicy' 'Vegetarian']
Multi-Label Binarization¶
The MultiLabelBinarizer converts our list-of-foods format into a binary matrix that we can use to create our model. It one hot encodes the values.
Each column represents a food category. 1 indicates that wine pairs well with that food.
Train–test split¶
We split the data into training (80%) and test (20%) sets. The model is trained on the training set and evaluated on the test set. We ensure that there is no peeking.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size=0.2, random_state=42)
print(f"Training Set Size: {X_train.shape[0]} wines ({X_train.shape[0] / len(X) * 100:.1f}%)")
print(f"Test Set Size: {X_test.shape[0]} wines ({X_test.shape[0] / len(X) * 100:.1f}%)")
Training Set Size: 1087 wines (80.0%) Test Set Size: 272 wines (20.0%)
Standardizing Chemical Properties¶
Since KNN is a distance-based algorithm, it calculates how 'close' wines are to each other based on the chemical properties. However, the chemical properties have different scales. For example, Alcohol% is a percentage, Chlorides is g/L, Total sulfure dioxide is mg/L.
Without standardization, the properties with larger numeric ranges would dominate the distance calculation, even if they were less important. Standardizing them weighs all the chemical properties equally, allowing the model to learn which ones are truly important for paiting.
We are standardizing on the training data and test data separately to avoid any data leakage between the two.
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_test_sc = scaler.transform(X_test)
Finding the Optimal Value of $k$¶
The most important hyperparameter in KNN is $k$. This is the number of nearest neighbours to consider when making predictions. We will plot a graph of Model Accuracy and $k$, as well as F1 Score and $k$ to determine what the optimal value is.
We will plot a curve to see how k affects model performance.
Cross-Validation Approach¶
We will also be using 5-fold-cross validation while testing k values from 1 to 40. For this we are first splitting data into 5 folds, then for each $k$ we train on 4 folds and validate on the 5th. We repeat this 5 times, changing the validation fold each time. This gives a good estimate of how each $k$ would perform on unseen data
# 5-fold cross validation
cv = KFold(n_splits=5, shuffle=True, random_state=42)
k_values = range(1, 41)
cv_acc = [] # subset accuracy across all folds
cv_f1 = [] # macro F1 across all folds
for k in k_values:
# Pipeline: scale features inside each fold, then apply KNN
pipe = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=k))
])
# Out-of-fold predictions for the entire dataset
Y_pred_cv = cross_val_predict(pipe, X, Y, cv=cv)
# Metrics using all out-of-fold predictions
acc = accuracy_score(Y, Y_pred_cv) # subset accuracy
f1 = f1_score(Y, Y_pred_cv, average='macro', zero_division=0)
cv_acc.append(acc)
cv_f1.append(f1)
# Best k's according to CV
best_k_acc = k_values[int(np.argmax(cv_acc))]
best_k_f1 = k_values[int(np.argmax(cv_f1))]
# Plot
plt.figure(figsize=(10, 8))
plt.plot(k_values, cv_acc, marker='.', label='Cross Validation Accuracy', color='crimson')
plt.plot(k_values, cv_f1, marker='.', label='Macro F1 Score', color='gold')
# Vertical line at best k
plt.axvline(best_k_f1, linestyle='--', color='green', linewidth=2,
label=f'Best k (F1) = {best_k_f1}', alpha=0.2)
plt.title("Accuracy and Macro F1 vs k for KNN Multi-Label Classification", fontsize=14)
plt.xlabel("k (Number of Neighbors)")
plt.ylabel("Score")
plt.xticks(k_values)
plt.grid(alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()
print(f"Best k by subset accuracy: {best_k_acc} (accuracy = {max(cv_acc):.4f})")
print(f"Best k by macro F1: {best_k_f1} (F1 = {max(cv_f1):.4f})")
Best k by subset accuracy: 1 (accuracy = 0.8337) Best k by macro F1: 1 (F1 = 0.8465)
Understanding Cross Validation Results & Optimal $k$¶
The graph above shows how the performance of the model changes as $k$ increases from 1 to 40, evaluated using 5-fold cross validation.
The metrics shown in the graph are Cross-Validation Accuracy (red) and Macro F1 Score (yellow). The Cross-Validation Accuracy measures the percentage of wines where all predicted food labes exactly match the true labels. This is a strict metric. For example, if a wine pairs well with Red Meat, Cheese, and Acidic foods, the model only gets credit if it predicts all three correctly.
The Macro F1 Score is an average of F1 Scores across all the food categories equally. It prevents common pairings from influencing the metric.
Both of these metrics peak at $k$ = 1:
- Subset Accuracy = 83.37%
- Macro F1 Score: 84.65%
The Issue With $k$ = 1¶
The value of $k$ = 1 is concerning because it usually means that the model memorizes training examples rather than learning patterns. This causes overfitting. Because of this, it works well for training data, but fails on unseen examples.
However, $k$ = 1 is genuinely the optimal value of $k$ for this problem. This is because we used Cross Validation and evaluated on all folds, not just training data. If $k$ = 1 was causing overfitting, we would see a poor CV Accuracy, however, this is not the case. Averaging over $k > 1$ neighbours dilutes the spicific recommendations of wine and food pairings.
Additionally, we can see these trends in the graph. As $k$ increases, both metrics steadily decrease.
Creating a model using $k$ = 1¶
# Choose k
k = 1
model = KNeighborsClassifier(n_neighbors=k)
model.fit(X_train_sc, Y_train)
KNeighborsClassifier(n_neighbors=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=1)
Using the model to predict food pairings on the test set¶
Y_pred = model.predict(X_test_sc)
Y_pred.shape
(272, 12)
Using The Model To Predict Foods For A Single Wine¶
Finally, we can use this model that we just created to predict the list of foods that this wine pairs well with.
# Take one example wine from df_wines
x_new = df_wines.loc[0, feature_cols].values.reshape(1, -1)
# Standardize using the same scaler
x_new_sc = scaler.transform(x_new)
# Predict the multi-label output
Y_new_pred = model.predict(x_new_sc)
# Decode back to food category names
pred_foods = mlb.inverse_transform(Y_new_pred)[0]
print("Chemical properties of example wine:")
display(df_wines.loc[0, feature_cols])
print("\nPredicted good food pairings for this wine:")
print(pred_foods)
Chemical properties of example wine:
| 0 | |
|---|---|
| fixed acidity | 7.4 |
| volatile acidity | 0.7 |
| citric acid | 0.0 |
| residual sugar | 1.9 |
| chlorides | 0.076 |
| free sulfur dioxide | 11.0 |
| total sulfur dioxide | 34.0 |
| density | 0.9978 |
| pH | 3.51 |
| sulphates | 0.56 |
| alcohol | 9.4 |
Predicted good food pairings for this wine:
('Pork', 'Poultry', 'Seafood')
Evaluating the KNN model¶
We will be using different metrics to measure how well the classification model performs.
- Subset accuracy: proportion of wines where all predicted food labels exactly match the true labels.
- Micro-averaged precision, recall, and F1: aggregate over all food labels, giving more weight to frequent foods.
- Macro-averaged precision, recall, and F1: average metric per food label, treating all food categories equally.
# Global metrics
subset_acc = accuracy_score(Y_test, Y_pred)
micro_precision = precision_score(Y_test, Y_pred, average="micro", zero_division=0)
micro_recall = recall_score(Y_test, Y_pred, average="micro", zero_division=0)
micro_f1 = f1_score(Y_test, Y_pred, average="micro", zero_division=0)
macro_precision = precision_score(Y_test, Y_pred, average="macro", zero_division=0)
macro_recall = recall_score(Y_test, Y_pred, average="macro", zero_division=0)
macro_f1 = f1_score(Y_test, Y_pred, average="macro", zero_division=0)
print("Metrics:")
print(f"Subset Accuracy: {subset_acc:.4f}\n")
print(f"Micro Precision: {micro_precision:.4f}")
print(f"Micro Recall: {micro_recall:.4f}")
print(f"Micro F1: {micro_f1:.4f}\n")
print(f"Macro Precision: {macro_precision:.4f}")
print(f"Macro Recall: {macro_recall:.4f}")
print(f"Macro F1: {macro_f1:.4f}")
Metrics: Subset Accuracy: 0.8529 Micro Precision: 0.8936 Micro Recall: 0.8944 Micro F1: 0.8940 Macro Precision: 0.8692 Macro Recall: 0.8704 Macro F1: 0.8697
Subset Accuracy: 85.29%
- This means that for over 4/5th of the wines, our model is predicting all the good food categories that it pairs with correctly.
Micro Precision: 89.36%
- Of all the food pairings we predicted, 89.36% were actually good pairings
- The high precision matters because it means that users won't waste money on bad wine choices
Micro Recall: 89.44%
- Of all the true good pairings in the test set, we successfully identified 89.44%
Macro Precision: 86.92%
- Average precision across all 12 food categories
Macro Recall: 87.04%
- Average recall across all 12 food categories
We can also look at the performance for each food category using classification_report.
print("Performance For Each Food Category:\n")
print(classification_report(Y_test, Y_pred, target_names=mlb.classes_, zero_division=0))
Performance For Each Food Category:
precision recall f1-score support
Acidic 0.83 0.81 0.82 98
Cheese 0.84 0.84 0.84 49
Creamy 0.90 0.92 0.91 97
Dessert 0.82 0.80 0.81 80
Pork 0.96 0.97 0.96 183
Poultry 0.95 0.96 0.95 135
Red Meat 0.85 0.88 0.86 58
Salty Snack 0.84 0.84 0.84 49
Seafood 0.94 0.93 0.93 184
Smoky BBQ 0.84 0.84 0.84 49
Spicy 0.84 0.84 0.84 49
Vegetarian 0.84 0.84 0.84 49
micro avg 0.89 0.89 0.89 1080
macro avg 0.87 0.87 0.87 1080
weighted avg 0.89 0.89 0.89 1080
samples avg 0.90 0.90 0.88 1080
This looks at the 12 food categories and provides individual performances for each one.
Some food categories with the highest metrics are Pork, Poultry, Seafood, Creamy Food, Red Meat, and Cheese.
Seafood having a 0.93 F1 Score and Red Meat having a 0.86 F1 score suggests that the traditional pairing rules are supported by chemistry.
Confusion Matrix¶
Below we can plot an aggregated confusion matrix
# Calculate aggregate statistics across all categories
from sklearn.metrics import multilabel_confusion_matrix
mcm = multilabel_confusion_matrix(Y_test, Y_pred)
# Sum across all categories
aggregate_cm = mcm.sum(axis=0)
# Create visualization
plt.figure(figsize=(8, 6))
sns.heatmap(aggregate_cm, annot=True, fmt='d', cmap='Spectral',
xticklabels=['Not Paired', 'Paired'],
yticklabels=['Not Paired', 'Paired'],
cbar_kws={'label': 'Count'})
plt.title('Aggregated Confusion Matrix Across All Food Categories',
fontweight='bold', fontsize=14)
plt.xlabel('Predicted', fontweight='bold', fontsize=12)
plt.ylabel('Actual', fontweight='bold', fontsize=12)
plt.tight_layout()
plt.show()
# Print interpretation
tn, fp, fn, tp = aggregate_cm.ravel()
print("Aggregated Confusion Matrix:")
print(f"\nTrue Negatives (TN): {tn} - Correctly identified non-pairings")
print(f"False Positives (FP): {fp} - Falsely recommended pairings")
print(f"False Negatives (FN): {fn} - Missed good pairings")
print(f"True Positives (TP): {tp} - Correctly identified good pairings")
total = tn + fp + fn + tp
accuracy = (tn + tp) / total
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
print(f"\nMetrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
Aggregated Confusion Matrix: True Negatives (TN): 2069 - Correctly identified non-pairings False Positives (FP): 115 - Falsely recommended pairings False Negatives (FN): 114 - Missed good pairings True Positives (TP): 966 - Correctly identified good pairings Metrics: Accuracy: 0.9298 Precision: 0.8936 Recall: 0.8944 F1-Score: 0.8940
Across 3,264 wine-food pairing predictions (272 wines x 12 categories):
- 2,069 True Negatives: Correctly identified non-pairings
- 966 True Positives: Correctly identified good pairings
- 115 False Positives: Bad recommendations
- 114 False Negatives: Missed good pairings
Implications¶
Based on these metrics, our hypothesis is proved to be True. The chemical properties of wine help us predict complex pairing patterns.
There are lots of implications of this. Consumers can trust recommendations from the model and make better informed purchasing decisions. Restaurants can build wine lists more effectively. Winemakers can adjust their chemistry to target certain food categories they wish to.
Conclusion & Future Work¶
Conclusion¶
Overall, the project justified our central hypothesis: chemical properties plays the most meaningful role in determining which foods pair best with a wine.
Through extensive EDA and model evaluation, we were able to establish the significant links between acidity, sweetness, body, tannin levels, and the successfully paired food categories.
Our KNN classification model demonstrated that wine chemistry does contain predictive structure. While the performance was modest, reflecting the complexity and imbalance of the dataset, it confirmed that chemical features do influence pairing outcomes and can be used to classify food categories.
Strengths¶
Strong EDA Foundation: The project thoroughly explored the datasets before modeling, uncovering wine-type differences, chemical-property distributions, food category imbalances, and chemically driven pairing patterns. This provided a clear, evidence-based justification for the modeling approach.
Clear and Interpretable Features: The discretized chemistry features acidity level, sweetness level, body level, and tannin proxy, aligned naturally with real-world pairing rules, making the model interpretable and intuitive.
Consistency between EDA and Model Insights: Patterns seen in the heatmaps (e.g., medium traits being most versatile, off-dry sweetness pairing broadly) were reflected in model predictions, reinforcing the dataset’s internal structure.
Practical and Scalable Prediction Goal: Predicting food pairings from chemistry alone mimics real recommendation systems and sommelier logic—making the work relevant beyond the academic setting.
Limitations¶
Dataset Imbalance: Some food categories (e.g., Red Meat, Cheese, Acidic dishes) were heavily overrepresented while others (Dessert, Smoky BBQ, Spicy) had very few samples, which reduces model fairness, causes the KNN to overpredict common classes, and limits evaluation of rare pairing categories.
Limited Feature Richness: The model used categorical chemical levels, not the raw numeric chemistry. This removed nuance, such as differences between moderate vs slightly-high acidity, sugar–acid balance, or alcohol effects.
KNN Model Constraints: KNN struggles when classes overlap (as they do here) and many categories exist. More advanced models would likely handle this structure better.
Incomplete Pairing Labels Due to Limited Data: We originally attempted to generate good, neutral, and bad food pairings from the dataset, which resulted in highly unstable labels. Many wine–style combinations contained no reliable negative examples, and the cleaning process (which prioritized good over neutral and bad) removed most remaining “bad” entries entirely. This limits the model’s ability to learn full-spectrum pairing behavior.
Future Work¶
Expand and Balance the Pairing Dataset: Obtaining a larger, real-world dataset with more evenly represented food categories would address the strong class imbalance and allow the model to learn more robust pairing patterns across all cuisines.
Incorporate Raw Numerical Chemistry and Feature Engineering: Using continuous chemical measurements, such as exact pH, residual sugar, acidity ratios, alcohol levels, and sulphates, would capture nuance lost in categorical labels and provide models with richer information about wine structure.
Adopt More Advanced Modeling Techniques: Tree-based models (Random Forest, XGBoost) or neural networks could better handle overlapping classes and nonlinear relationships, offering significant performance improvements over KNN.
Develop Stable Multi-Label Pairing Predictions: Future work should include reliable data on neutral and bad pairings, enabling models to recommend multiple food categories per wine and capture the full spectrum of pairing outcomes rather than only “good” matches.
Relevent Resources¶
Puckette, M. (n.d.). Food and Wine Pairing Basics (Start Here!). Wine Folly. Retrieved December 8, 2025, from https://winefolly.com/wine-pairing/getting-started-with-food-and-wine-pairing/
How Wine Folly Supports Our Findings: Wine Folly identifies acidity, sweetness, bitterness (tannins), and intensity (body) as the core drivers of food–wine interactions. This directly reinforces our conclusion that chemical composition, not wine type alone, determines pairing success.
Alignment With Project Strengths: Wine Folly’s framework validates our chemistry-based approach. Our discretized features (acidity, sweetness, body, and tannin level) parallel the exact taste dimensions the article highlights. The consistency between our heatmaps and Wine Folly’s pairing rules strengthens the credibility and interpretability of our EDA and KNN model.
Reinforcing Our Limitations: The article stresses nuance in acidity, sweetness, and intensity that we lose with categorical labels. Wine Folly also discusses clashing interactions, illuminating why sparse negative labels limited our model. Finally, the article’s diverse pairing examples highlight how imbalanced food categories can distort predictive modeling.
Implications for Future Work: Wine Folly’s principles support our recommendations: use raw chemical values, incorporate richer pairing data (including negative matches), and adopt more expressive models to capture nonlinear taste interactions. This would better reflect the multidimensional pairing logic described in Wine Folly.
%shell jupyter nbconvert --to html /content/data-science-final-project/DataScienceProject.ipynb --output index.html
[NbConvertApp] Converting notebook /content/data-science-final-project/DataScienceProject.ipynb to html [NbConvertApp] WARNING | Alternative text is missing on 6 image(s). [NbConvertApp] Writing 1093093 bytes to /content/data-science-final-project/index.html