!git clone https://github.com/akjieettt/data-science-final-project.git
%cd data-science-final-project/

Cloning into 'data-science-final-project'...
remote: Enumerating objects: 177, done.
remote: Counting objects: 100% (68/68), done.
remote: Compressing objects: 100% (64/64), done.
remote: Total 177 (delta 41), reused 4 (delta 4), pack-reused 109 (from 1)
Receiving objects: 100% (177/177), 4.93 MiB | 14.27 MiB/s, done.
Resolving deltas: 100% (81/81), done.
/content/data-science-final-project

# Import necessary libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Load the wine quality datasets
df_reds = pd.read_csv("data/winequality-red.csv", sep=";")
df_whites = pd.read_csv("data/winequality-white.csv", sep=";")

# Add wine type identifiers for red and white
df_reds['type'] = 'red'
df_whites['type'] = 'white'

# Combine the red and white wine datasets into a single df
df_wines = pd.concat([df_reds, df_whites], ignore_index=True)

# Printing the number of wines in each dataset
print(f"Total Number of Red Wines: {len(df_reds)}")
print(f"Total Number of White Wines: {len(df_whites)}")
print(f"Total Number of Wines: {len(df_wines)}")
print(f"Total Number of NaNs:")
display(df_wines.isna().sum())

print("\nThe Dataset:")
display(df_wines.head())

Total Number of Red Wines: 1599
Total Number of White Wines: 4898
Total Number of Wines: 6497
Total Number of NaNs:

The Dataset:

# Loading the wine and food pairing data
df_pairing = pd.read_csv("data/wine_food_pairings.csv")

print(f"Total Number of Wine and Food Pairings: {len(df_pairing)}\n")
print(f"Total Number of NaNs:")
display(df_pairing.isna().sum())

print("\nThe Dataset:")
display(df_pairing.head())

Total Number of Wine and Food Pairings: 34933

Total Number of NaNs:

The Dataset:

# Drop the wine and food pairings that don't have a red or white wine
df_pairing = df_pairing[df_pairing['wine_category'].isin(['Red', 'White'])]
df_pairing.reset_index(drop=True, inplace=True)

display(df_pairing.groupby('wine_category').size())

# Make sure pairing_quality is numeric (int or float)
df_pairing['pairing_quality'] = pd.to_numeric(df_pairing['pairing_quality'])

quality_to_label = {
    1: 'bad',
    2: 'bad',
    3: 'neutral',
    4: 'good',
    5: 'good'
}

df_pairing['quality_label'] = df_pairing['pairing_quality'].map(quality_to_label)

display(df_pairing.head())

# The different descriptions
df_pairing.groupby('description').size()

# Function to infer the acidity of the wine
def infer_acidity(description):
    d = description.lower()
    if "low-acid" in d:
        return "Low"
    if "acidic" in d or "crisp acidity" in d:
        return "High"
    return "Medium"

# Function to infer the sweetness of the wine
def infer_sweetness(description):
    d = description.lower()
    if "off-dry" in d:
        return "Off-Dry"
    if "dry" in d:
        return "Dry"
    return "Off-Dry"

# Function to infer the body of the wine
def infer_body(description):
    d = description.lower()
    if "light" in d or "delicate" in d or "lean" in d:
        return "Light"
    if "heavy" in d or "rich" in d or "full" in d:
        return "Full"
    return "Medium"

# Function to infer the Tannin Proxy of the wine
def infer_tannin(description):
    d = description.lower()
    if "tannic" in d or "high tannin" in d:
        return "High"
    if "delicate wine overwhelmed" in d:
        return "Low"
    return "Medium"

# Applying these functions to get values for acidity, sweetness, body, and tannin proxy for the wines
df_pairing["acidity_level"] = df_pairing["description"].apply(infer_acidity)
df_pairing["sweetness_level"] = df_pairing["description"].apply(infer_sweetness)
df_pairing["body_level"] = df_pairing["description"].apply(infer_body)
df_pairing["tannin_proxy"] = df_pairing["description"].apply(infer_tannin)

df_pairing.head()

# Chemical property distributions by wine type
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

colors = {
    'red': 'darkred',
    'white': 'gold'
}

# Fixed Acidity
red_data = df_wines[df_wines['type'] == 'red']['fixed acidity']
white_data = df_wines[df_wines['type'] == 'white']['fixed acidity']
axes[0, 0].hist(red_data, bins=30, alpha=0.5, label='Red',
                color=colors['red'], linewidth=0.5)
axes[0, 0].hist(white_data, bins=30, alpha=0.5, label='White',
                color=colors['white'], linewidth=0.5)
axes[0, 0].set_title('Fixed Acidity Distribution', fontweight='bold', fontsize=12)
axes[0, 0].set_xlabel('Fixed Acidity', fontweight='bold')
axes[0, 0].set_ylabel('Count', fontweight='bold')
axes[0, 0].legend(loc='upper right')
axes[0, 0].grid(axis='y', alpha=0.3)

# Residual Sugar
red_data = df_wines[df_wines['type'] == 'red']['residual sugar']
white_data = df_wines[df_wines['type'] == 'white']['residual sugar']
axes[0, 1].hist(red_data, bins=30, alpha=0.5, label='Red',
                color=colors['red'], linewidth=0.5)
axes[0, 1].hist(white_data, bins=30, alpha=0.5, label='White',
                color=colors['white'], linewidth=0.5)
axes[0, 1].set_title('Residual Sugar Distribution', fontweight='bold', fontsize=12)
axes[0, 1].set_xlabel('Residual Sugar', fontweight='bold')
axes[0, 1].set_ylabel('Count', fontweight='bold')
axes[0, 1].legend(loc='upper right')
axes[0, 1].grid(axis='y', alpha=0.3)

# Alcohol %
red_data = df_wines[df_wines['type'] == 'red']['alcohol']
white_data = df_wines[df_wines['type'] == 'white']['alcohol']
axes[1, 0].hist(red_data, bins=30, alpha=0.5, label='Red',
                color=colors['red'], linewidth=0.5)
axes[1, 0].hist(white_data, bins=30, alpha=0.5, label='White',
                color=colors['white'], linewidth=0.5)
axes[1, 0].set_title('Alcohol % Distribution', fontweight='bold', fontsize=12)
axes[1, 0].set_xlabel('Alcohol', fontweight='bold')
axes[1, 0].set_ylabel('Count', fontweight='bold')
axes[1, 0].legend(loc='upper right')
axes[1, 0].grid(axis='y', alpha=0.3)

# Sulphates
red_data = df_wines[df_wines['type'] == 'red']['sulphates']
white_data = df_wines[df_wines['type'] == 'white']['sulphates']
axes[1, 1].hist(red_data, bins=30, alpha=0.5, label='Red',
                color=colors['red'], linewidth=0.5)
axes[1, 1].hist(white_data, bins=30, alpha=0.5, label='White',
                color=colors['white'], linewidth=0.5)
axes[1, 1].set_title('Sulphates Distribution', fontweight='bold', fontsize=12)
axes[1, 1].set_xlabel('Sulphates', fontweight='bold')
axes[1, 1].set_ylabel('Count', fontweight='bold')
axes[1, 1].legend(loc='upper right')
axes[1, 1].grid(axis='y', alpha=0.3)

plt.suptitle('Chemical Property Distributions by Wine Type',
             fontweight='bold', fontsize=14, y=1.00)
plt.tight_layout()
plt.show()

# Create chemical categories for pairing analysis
df_wines['acidity_level'] = pd.cut(df_wines['fixed acidity'],
                                     bins=[0, 7, 9, 15],
                                     labels=['Low', 'Medium', 'High'])

df_wines['sweetness_level'] = pd.cut(df_wines['residual sugar'],
                                       bins=[0, 2, 10, 100],
                                       labels=['Dry', 'Off-Dry', 'Sweet'])

df_wines['body_level'] = pd.cut(df_wines['alcohol'],
                                  bins=[0, 10, 11.5, 15],
                                  labels=['Light', 'Medium', 'Full'])

df_wines['tannin_proxy'] = pd.cut(df_wines['sulphates'],
                                    bins=[0, 0.5, 0.7, 2],
                                    labels=['Low', 'Medium', 'High'])
df_wines.head()

# Define ordinal encodings for style categories
acidity_map   = {'Low': 0, 'Medium': 1, 'High': 2}
sweet_map     = {'Dry': 0, 'Off-Dry': 1, 'Sweet': 2}
body_map      = {'Light': 0, 'Medium': 1, 'Full': 2}
tannin_map    = {'Low': 0, 'Medium': 1, 'High': 2}

def encode_style(df): # Convert columns into numeric codes to compute distance
    encoded = {'acid_num':   df['acidity_level'].map(acidity_map),
               'sweet_num':  df['sweetness_level'].map(sweet_map),
               'body_num':   df['body_level'].map(body_map),
               'tannin_num': df['tannin_proxy'].map(tannin_map)}
    encoded_df = pd.DataFrame(encoded)
    return encoded_df

style_cols = ['acidity_level', 'sweetness_level', 'body_level', 'tannin_proxy']

# Aggregate df_pairing into style-level food lists per pairing_label
agg = (df_pairing
       .groupby(style_cols + ['quality_label'])['food_category']
       .apply(lambda s: sorted(set(s.dropna())))  # set to ensure no duplicates
       .reset_index())

# Pivot table so that each col has good, neutral, bad
pivot = agg.pivot_table(index=style_cols,
                        columns='quality_label',
                        values='food_category',
                        aggfunc=lambda x: x).reset_index()

# Making sure that all exist
for col in ['good', 'neutral', 'bad']:
    if col not in pivot.columns:
        pivot[col] = None

# Clean rules table
pivot = pivot[['acidity_level', 'sweetness_level', 'body_level', 'tannin_proxy',
               'good', 'neutral', 'bad']]

rules = pivot.copy()
rules.head()

# Numerical encodings for rules and wines
rules_style_num = encode_style(rules)
wines_style_num = encode_style(df_wines)

# Convert to numpy to calculate distance
rules_X = rules_style_num.to_numpy()
wines_X = wines_style_num.to_numpy()

# Find index of closest style rule by euclidean distance
nearest_idx = []
for i in range(len(df_wines)):
    diff = rules_X - wines_X[i]
    dist = (diff ** 2).sum(axis=1)
    nearest_idx.append(dist.argmin())

nearest_idx = np.array(nearest_idx)

# Convert rule columns to numpy arrays
good_arr = rules['good'].to_numpy()
neutral_arr = rules['neutral'].to_numpy()
bad_arr = rules['bad'].to_numpy()

# Use nearest_idx to pick the rule for each wine
good_vals = [g if isinstance(g, list) else []
             for g in good_arr[nearest_idx]]

neutral_vals = [g if isinstance(g, list) else []
                for g in neutral_arr[nearest_idx]]

bad_vals = [g if isinstance(g, list) else []
            for g in bad_arr[nearest_idx]]

# Assign these cols to df_wines
df_wines['good_foods_to_pair_with'] = good_vals
df_wines['neutral_foods_to_pair_with'] = neutral_vals
df_wines['bad_foods_to_pair_with'] = bad_vals

df_wines[['good_foods_to_pair_with',
          'neutral_foods_to_pair_with',
          'bad_foods_to_pair_with']].head()

def clean_lists(row):
    g = set(row['good_foods_to_pair_with'] or [])
    n = set(row['neutral_foods_to_pair_with'] or []) - g
    b = set(row['bad_foods_to_pair_with'] or []) - g - n

    return pd.Series([list(g), list(n), list(b)],
                     index=['good_foods_to_pair_with', 'neutral_foods_to_pair_with', 'bad_foods_to_pair_with'])

df_wines[['good_foods_to_pair_with',
          'neutral_foods_to_pair_with',
          'bad_foods_to_pair_with']] = df_wines.apply(clean_lists, axis=1)

df_wines.head()

# Chemical categories visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 9))

# Acidity levels by wine type
pd.crosstab(df_wines['type'], df_wines['acidity_level']).plot(
    kind='bar', ax=axes[0, 0],
    color=['lightblue', 'skyblue', 'darkblue'],
    alpha=0.7, edgecolor='black'
)
axes[0, 0].set_title('Acidity Levels by Wine Type', fontweight='bold')
axes[0, 0].set_xlabel('Wine Type')
axes[0, 0].set_ylabel('Count')
axes[0, 0].tick_params(axis='x', rotation=0)
axes[0, 0].legend(title='Acidity', loc='upper right')
axes[0, 0].grid(axis='y', alpha=0.3)

# Sweetness levels by wine type
pd.crosstab(df_wines['type'], df_wines['sweetness_level']).plot(
    kind='bar', ax=axes[0, 1],
    color=['tan', 'wheat', 'gold'],
    alpha=0.7, edgecolor='black'
)
axes[0, 1].set_title('Sweetness Levels by Wine Type', fontweight='bold')
axes[0, 1].set_xlabel('Wine Type')
axes[0, 1].set_ylabel('Count')
axes[0, 1].tick_params(axis='x', rotation=0)
axes[0, 1].legend(title='Sweetness', loc='upper right')
axes[0, 1].grid(axis='y', alpha=0.3)

# Body levels by wine type
pd.crosstab(df_wines['type'], df_wines['body_level']).plot(
    kind='bar', ax=axes[1, 0],
    color=['lightcoral', 'coral', 'darkred'],
    alpha=0.7, edgecolor='black'
)
axes[1, 0].set_title('Body Levels by Wine Type', fontweight='bold')
axes[1, 0].set_xlabel('Wine Type')
axes[1, 0].set_ylabel('Count')
axes[1, 0].tick_params(axis='x', rotation=0)
axes[1, 0].legend(title='Body', loc='upper right')
axes[1, 0].grid(axis='y', alpha=0.3)

# Tannin levels by wine type
pd.crosstab(df_wines['type'], df_wines['tannin_proxy']).plot(
    kind='bar', ax=axes[1, 1],
    color=['lightgreen', 'mediumseagreen', 'darkgreen'],
    alpha=0.7, edgecolor='black'
)
axes[1, 1].set_title('Tannin Levels by Wine Type', fontweight='bold')
axes[1, 1].set_xlabel('Wine Type')
axes[1, 1].set_ylabel('Count')
axes[1, 1].tick_params(axis='x', rotation=0)
axes[1, 1].legend(title='Tannin', loc='upper right')
axes[1, 1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# Chemical properties vs Quality visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 9))

# Acidity vs Quality
df_wines.boxplot(column='fixed acidity', by='quality', ax=axes[0, 0])
axes[0, 0].set_title('Acidity by Quality Category', fontweight='bold')
axes[0, 0].set_xlabel('Quality Category')
axes[0, 0].set_ylabel('Fixed Acidity (g/L)')
plt.sca(axes[0, 0])

# Residual Sugar vs Quality
df_wines.boxplot(column='residual sugar', by='quality', ax=axes[0, 1])
axes[0, 1].set_title('Residual Sugar by Quality Category', fontweight='bold')
axes[0, 1].set_xlabel('Quality Category')
axes[0, 1].set_ylabel('Residual Sugar (g/L)')
axes[0, 1].set_ylim(0, 20)  # Focus on main distribution
plt.sca(axes[0, 1])

# Alcohol vs Quality
df_wines.boxplot(column='alcohol', by='quality', ax=axes[1, 0])
axes[1, 0].set_title('Alcohol Content by Quality Category', fontweight='bold')
axes[1, 0].set_xlabel('Quality Category')
axes[1, 0].set_ylabel('Alcohol (%)')
plt.sca(axes[1, 0])

# Tannins vs Quality
df_wines.boxplot(column='sulphates', by='quality', ax=axes[1, 1])
axes[1, 1].set_title('Tannins by Quality Category', fontweight='bold')
axes[1, 1].set_xlabel('Quality Category')
axes[1, 1].set_ylabel('Sulphates (g/L)')
plt.sca(axes[1, 1])

plt.tight_layout()
plt.show()

# Explode good pairings: one row per (wine, food)
df_good = (
    df_wines[["type", "good_foods_to_pair_with"]]
    .explode("good_foods_to_pair_with")
    .dropna(subset=["good_foods_to_pair_with"])
    .rename(columns={
        "type": "wine_category",
        "good_foods_to_pair_with": "food_category"
    })
)

# Crosstab counts: wine_category x food_category
cross = pd.crosstab(df_good["wine_category"], df_good["food_category"])

# Normalize rows => convert to proportions
cross_norm = cross.div(cross.sum(axis=1), axis=0)

# Heatmap (Normalized)
plt.figure(figsize=(12, 8))
sns.heatmap(
    cross_norm,
    cmap="Spectral",
    annot=True,
    fmt=".2f",
    linewidths=0.5,
    linecolor="white"
)

plt.xlabel("Food Category")
plt.ylabel("Wine Category")
plt.title("Wine Category vs Good Food Pairings")
plt.tight_layout()
plt.show()

# Explode good foods so every food pairing is a row
df_good = (
    df_wines[["acidity_level","sweetness_level","body_level","tannin_proxy","good_foods_to_pair_with"]]
    .explode("good_foods_to_pair_with")
    .dropna(subset=["good_foods_to_pair_with"])
    .rename(columns={"good_foods_to_pair_with": "food_category"})
)

# Chemical properties and matching color palettes
chem_props = {
    "acidity_level": {
        "title": "Acidity Level",
        "colors": ["lightblue", "skyblue", "darkblue"]
    },
    "sweetness_level": {
        "title": "Sweetness Level",
        "colors": ["tan", "wheat", "gold"]
    },
    "body_level": {
        "title": "Body Level",
        "colors": ["lightcoral", "coral", "darkred"]
    },
    "tannin_proxy": {
        "title": "Tannin Level",
        "colors": ["lightgreen", "mediumseagreen", "darkgreen"]
    }
}

# Subplot figure in a 2x2 grid
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

for ax, (col, cfg) in zip(axes.flat, chem_props.items()):

    # Build crosstab
    ct = pd.crosstab(df_good[col], df_good["food_category"])

    # Heatmap
    sns.heatmap(
        ct,
        cmap=sns.color_palette(cfg["colors"], as_cmap=True),
        annot=True,
        fmt="d",
        linewidths=0.5,
        linecolor="white",
        cbar=False,
        ax=ax
    )

    # Titles + Labels
    ax.set_title(f"{cfg['title']} vs Food Category\n(Good Pairings)", fontweight="bold")
    ax.set_xlabel("Food Category")
    ax.set_ylabel(cfg["title"])

plt.tight_layout()
plt.show()

plt.figure(figsize=(12, 6))
sns.countplot(data=df_pairing, x="food_category", hue="food_category", palette="tab20c")
plt.xticks(rotation=45, ha="right")
plt.xlabel("Food Category")
plt.ylabel("Count")
plt.title("Frequency of Food Categories in Pairing Dataset")
plt.tight_layout()
plt.show()

# Imports
from sklearn.preprocessing import MultiLabelBinarizer, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold, cross_val_predict
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import accuracy_score, f1_score, hamming_loss, jaccard_score, precision_score, recall_score, classification_report

# Keep only wines with at least one good food pairing
df_ml = df_wines[df_wines["good_foods_to_pair_with"].apply(lambda x: len(x) > 0)].copy()
print("Number of wines with at least one good pairing:", len(df_ml))

# Chemical feature columns
feature_cols = ["fixed acidity", "volatile acidity", "citric acid", "residual sugar", "chlorides", "free sulfur dioxide", "total sulfur dioxide", "density", "pH", "sulphates", "alcohol"]

# Features and Target
X = df_ml[feature_cols].values
y_raw = df_ml["good_foods_to_pair_with"]   # list-of-lists

# Multilabel binarizer
mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(y_raw)

print("Food label classes:", mlb.classes_)

Number of wines with at least one good pairing: 1359
Food label classes: ['Acidic' 'Cheese' 'Creamy' 'Dessert' 'Pork' 'Poultry' 'Red Meat'
 'Salty Snack' 'Seafood' 'Smoky BBQ' 'Spicy' 'Vegetarian']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size=0.2,    random_state=42)


print(f"Training Set Size: {X_train.shape[0]} wines ({X_train.shape[0] / len(X) * 100:.1f}%)")
print(f"Test Set Size: {X_test.shape[0]} wines ({X_test.shape[0] / len(X) * 100:.1f}%)")

Training Set Size: 1087 wines (80.0%)
Test Set Size: 272 wines (20.0%)

scaler = StandardScaler()
scaler.fit(X_train)

X_train_sc = scaler.transform(X_train)
X_test_sc  = scaler.transform(X_test)

# 5-fold cross validation
cv = KFold(n_splits=5, shuffle=True, random_state=42)

k_values = range(1, 41)

cv_acc = []   # subset accuracy across all folds
cv_f1  = []   # macro F1 across all folds

for k in k_values:
    # Pipeline: scale features inside each fold, then apply KNN
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier(n_neighbors=k))
    ])

    # Out-of-fold predictions for the entire dataset
    Y_pred_cv = cross_val_predict(pipe, X, Y, cv=cv)

    # Metrics using all out-of-fold predictions
    acc = accuracy_score(Y, Y_pred_cv)  # subset accuracy
    f1  = f1_score(Y, Y_pred_cv, average='macro', zero_division=0)

    cv_acc.append(acc)
    cv_f1.append(f1)

# Best k's according to CV
best_k_acc = k_values[int(np.argmax(cv_acc))]
best_k_f1  = k_values[int(np.argmax(cv_f1))]

# Plot
plt.figure(figsize=(10, 8))

plt.plot(k_values, cv_acc, marker='.', label='Cross Validation Accuracy', color='crimson')
plt.plot(k_values, cv_f1, marker='.', label='Macro F1 Score', color='gold')

# Vertical line at best k
plt.axvline(best_k_f1, linestyle='--', color='green', linewidth=2,
            label=f'Best k (F1) = {best_k_f1}', alpha=0.2)

plt.title("Accuracy and Macro F1 vs k for KNN Multi-Label Classification", fontsize=14)
plt.xlabel("k (Number of Neighbors)")
plt.ylabel("Score")
plt.xticks(k_values)
plt.grid(alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()

print(f"Best k by subset accuracy: {best_k_acc} (accuracy = {max(cv_acc):.4f})")
print(f"Best k by macro F1: {best_k_f1} (F1 = {max(cv_f1):.4f})")

Best k by subset accuracy: 1 (accuracy = 0.8337)
Best k by macro F1: 1 (F1 = 0.8465)

# Choose k
k = 1

model = KNeighborsClassifier(n_neighbors=k)
model.fit(X_train_sc, Y_train)

KNeighborsClassifier(n_neighbors=1)

KNeighborsClassifier(n_neighbors=1)

Y_pred = model.predict(X_test_sc)
Y_pred.shape

(272, 12)

# Take one example wine from df_wines
x_new = df_wines.loc[0, feature_cols].values.reshape(1, -1)

# Standardize using the same scaler
x_new_sc = scaler.transform(x_new)

# Predict the multi-label output
Y_new_pred = model.predict(x_new_sc)

# Decode back to food category names
pred_foods = mlb.inverse_transform(Y_new_pred)[0]

print("Chemical properties of example wine:")
display(df_wines.loc[0, feature_cols])

print("\nPredicted good food pairings for this wine:")
print(pred_foods)

Chemical properties of example wine:

Predicted good food pairings for this wine:
('Pork', 'Poultry', 'Seafood')

# Global metrics
subset_acc = accuracy_score(Y_test, Y_pred)

micro_precision = precision_score(Y_test, Y_pred, average="micro", zero_division=0)
micro_recall    = recall_score(Y_test, Y_pred, average="micro", zero_division=0)
micro_f1        = f1_score(Y_test, Y_pred, average="micro", zero_division=0)

macro_precision = precision_score(Y_test, Y_pred, average="macro", zero_division=0)
macro_recall    = recall_score(Y_test, Y_pred, average="macro", zero_division=0)
macro_f1        = f1_score(Y_test, Y_pred, average="macro", zero_division=0)

print("Metrics:")
print(f"Subset Accuracy: {subset_acc:.4f}\n")
print(f"Micro Precision: {micro_precision:.4f}")
print(f"Micro Recall: {micro_recall:.4f}")
print(f"Micro F1: {micro_f1:.4f}\n")
print(f"Macro Precision: {macro_precision:.4f}")
print(f"Macro Recall: {macro_recall:.4f}")
print(f"Macro F1: {macro_f1:.4f}")

Metrics:
Subset Accuracy: 0.8529

Micro Precision: 0.8936
Micro Recall: 0.8944
Micro F1: 0.8940

Macro Precision: 0.8692
Macro Recall: 0.8704
Macro F1: 0.8697

print("Performance For Each Food Category:\n")
print(classification_report(Y_test, Y_pred, target_names=mlb.classes_, zero_division=0))

Performance For Each Food Category:

              precision    recall  f1-score   support

      Acidic       0.83      0.81      0.82        98
      Cheese       0.84      0.84      0.84        49
      Creamy       0.90      0.92      0.91        97
     Dessert       0.82      0.80      0.81        80
        Pork       0.96      0.97      0.96       183
     Poultry       0.95      0.96      0.95       135
    Red Meat       0.85      0.88      0.86        58
 Salty Snack       0.84      0.84      0.84        49
     Seafood       0.94      0.93      0.93       184
   Smoky BBQ       0.84      0.84      0.84        49
       Spicy       0.84      0.84      0.84        49
  Vegetarian       0.84      0.84      0.84        49

   micro avg       0.89      0.89      0.89      1080
   macro avg       0.87      0.87      0.87      1080
weighted avg       0.89      0.89      0.89      1080
 samples avg       0.90      0.90      0.88      1080

# Calculate aggregate statistics across all categories
from sklearn.metrics import multilabel_confusion_matrix

mcm = multilabel_confusion_matrix(Y_test, Y_pred)

# Sum across all categories
aggregate_cm = mcm.sum(axis=0)

# Create visualization
plt.figure(figsize=(8, 6))

sns.heatmap(aggregate_cm, annot=True, fmt='d', cmap='Spectral',
            xticklabels=['Not Paired', 'Paired'],
            yticklabels=['Not Paired', 'Paired'],
            cbar_kws={'label': 'Count'})

plt.title('Aggregated Confusion Matrix Across All Food Categories',
          fontweight='bold', fontsize=14)
plt.xlabel('Predicted', fontweight='bold', fontsize=12)
plt.ylabel('Actual', fontweight='bold', fontsize=12)
plt.tight_layout()
plt.show()

# Print interpretation
tn, fp, fn, tp = aggregate_cm.ravel()

print("Aggregated Confusion Matrix:")
print(f"\nTrue Negatives (TN):  {tn} - Correctly identified non-pairings")
print(f"False Positives (FP): {fp} - Falsely recommended pairings")
print(f"False Negatives (FN): {fn} - Missed good pairings")
print(f"True Positives (TP):  {tp} - Correctly identified good pairings")


total = tn + fp + fn + tp
accuracy = (tn + tp) / total
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

print(f"\nMetrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

Aggregated Confusion Matrix:

True Negatives (TN):  2069 - Correctly identified non-pairings
False Positives (FP): 115 - Falsely recommended pairings
False Negatives (FN): 114 - Missed good pairings
True Positives (TP):  966 - Correctly identified good pairings

Metrics:
Accuracy: 0.9298
Precision: 0.8936
Recall: 0.8944
F1-Score: 0.8940

%shell jupyter nbconvert --to html /content/data-science-final-project/DataScienceProject.ipynb --output index.html

[NbConvertApp] Converting notebook /content/data-science-final-project/DataScienceProject.ipynb to html
[NbConvertApp] WARNING | Alternative text is missing on 6 image(s).
[NbConvertApp] Writing 1093093 bytes to /content/data-science-final-project/index.html

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	type
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5	red
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5	red
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5	red
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6	red
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5	red

	wine_type	wine_category	food_item	food_category	cuisine	pairing_quality	quality_label	description
0	Syrah/Shiraz	Red	smoked sausage	Smoky BBQ	Spanish	2	Poor	Heuristic pairing assessment.
1	Grenache	Red	charcuterie board	Salty Snack	French	3	Neutral	Heuristic pairing assessment.
2	Madeira	Fortified	lemon tart	Dessert	French	4	Good	Acidic wine balances acidic food.
3	Cabernet Sauvignon	Red	roast lamb	Red Meat	Mexican	5	Excellent	Tannic red complements red meat fat.
4	Viognier	White	duck à l’orange	Poultry	Vietnamese	2	Poor	Heuristic pairing assessment.

	wine_type	wine_category	food_item	food_category	cuisine	pairing_quality	quality_label	description
0	Syrah/Shiraz	Red	smoked sausage	Smoky BBQ	Spanish	2	bad	Heuristic pairing assessment.
1	Grenache	Red	charcuterie board	Salty Snack	French	3	neutral	Heuristic pairing assessment.
2	Cabernet Sauvignon	Red	roast lamb	Red Meat	Mexican	5	good	Tannic red complements red meat fat.
3	Viognier	White	duck à l’orange	Poultry	Vietnamese	2	bad	Heuristic pairing assessment.
4	Pinot Noir	Red	citrus salad	Acidic	Argentinian	4	good	Acidic wine balances acidic food.

	0
description
Acidic wine balances acidic food.	1850
Acidic wine balances acidic food.; Dry table wine clashes with dessert sweetness.	153
Crisp acidity suits seafood.	587
Deliberately bad pairing example for contrast.	3845
Delicate wine overwhelmed by red meat.	1254
Dry table wine clashes with dessert sweetness.	441
Heavy wine can dominate poultry.	97
Heuristic pairing assessment.	9035
High tannin intensifies spice heat.	48
Idealized perfect pairing example for contrast.	3694
Light red (Pinot) with salmon works.	48
Light/medium red suits richer poultry/pork.	238
Lighter wines fit poultry.	287
Low-acid wine seems flabby vs acids.	190
Off-dry sweetness calms spice.	50
Richer body matches creamy textures.	294
Tannic red complements red meat fat.	780
Tannic reds clash with delicate seafood.	340
Too lean for creamy dish.	870

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	type	acidity_level	sweetness_level	body_level	tannin_proxy
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5	red	Medium	Dry	Light	Medium
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5	red	Medium	Off-Dry	Light	Medium
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5	red	Medium	Off-Dry	Light	Medium
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6	red	High	Dry	Light	Medium
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5	red	Medium	Dry	Light	Medium

	good_foods_to_pair_with	neutral_foods_to_pair_with	bad_foods_to_pair_with
0	[]	[Dessert]	[Dessert]
1	[Pork, Poultry, Seafood]	[Creamy, Pork, Poultry, Seafood]	[Creamy, Pork]
2	[Pork, Poultry, Seafood]	[Creamy, Pork, Poultry, Seafood]	[Creamy, Pork]
3	[Dessert]	[Dessert]	[Dessert]
4	[]	[Dessert]	[Dessert]

Chemistry Behind Wine And Food Pairings¶

Information & Motivation¶

Project Overview¶

Research Questions¶

Background¶

Motivation For This Project¶

Collaboration Plan¶

Data Sources¶

Integration Strategy¶

Imports and Loading the Data¶

Loading the Wine Quality Data¶

Wine Quality Dataset Overview¶

Loading the Wine and Food Pairing Data¶

Wine and Food Pairing Dataset Overview¶

ETL and Integrating the data¶

Extracting Chemical Profiles from Expert Descriptions¶

Creating Matching Chemical Categories in the Wine Quality Dataset¶

Creating Chemical Categories¶

Integrating Both Datasets¶

Mapping Wines to Food Pairing Recommendations¶

Matching Wines to Rules via Nearest Neighbor¶

Assigning Pairing Recommendations¶

Removing Duplicates¶

Exploratory Data Analysis¶

Chemical Properties by Wine Type¶

Chemical Properties by Wine Quality¶

Wine Type vs Good Food Pairings¶

Chemical Properties vs Good Food Pairings¶

Frequency of Food Categories¶

Hypothesis¶

Why This Hypothesis Is Important¶

Why This Hypothesis Is Reasonable (Based on our EDA)¶

Model: Classification - Predicting Food Category From Chemical Properties¶

Creating the K-Nearest Neighbours Classifier¶

Multi-Label Binarization¶

Train–test split¶

Standardizing Chemical Properties¶

Finding the Optimal Value of $k$¶

Cross-Validation Approach¶

Understanding Cross Validation Results & Optimal $k$¶

The Issue With $k$ = 1¶

Creating a model using $k$ = 1¶

Using the model to predict food pairings on the test set¶

Using The Model To Predict Foods For A Single Wine¶

Evaluating the KNN model¶

Confusion Matrix¶

Implications¶

Conclusion & Future Work¶

Conclusion¶

Strengths¶

Limitations¶

Future Work¶

Relevent Resources¶