cat articles/noteable-iris

Analyzing the Iris dataset with ChatGPT's Noteable plugin

created 2023-05-27

The OpenCALM data I wrote about earlier was far too small, so I tried asking Noteable to analyze the Iris dataset, the classic dataset that everyone loves and that has probably been analyzed a hundred million times. The results were about what I expected, but it also quickly wrote code for plotting graphs and trying several algorithms, which felt very convenient. This is a short note about that.

The standard statistics output was unsurprising, but the pair plot was nice. I usually end up reading the documentation while writing pair plots, but it generated a clean seaborn plot and colored it by target, or species.

I only gave it an instruction like: "I want to build a model that predicts target using data other than target. What algorithms would be good for building the prediction model? Please answer using Noteable." From that, it split the data into train and test at 8:2, wrote implementations for five sklearn algorithms, and displayed the actual results. That was convenient because writing this by hand each time is bothersome. This time all models achieved 100% accuracy, but if the accuracy had differed by algorithm, I could probably ask why and get an explanation.

The notebook code automatically created by Noteable looked like this:

from sklearn.model_selection import train_test_split

X = iris_df.drop('target', axis=1)
y = iris_df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

models = [
    ('Logistic Regression', LogisticRegression()),
    ('Decision Tree', DecisionTreeClassifier()),
    ('Random Forest', RandomForestClassifier()),
    ('SVM', SVC()),
    ('KNN', KNeighborsClassifier())
]

results = []

for name, model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    results.append((name, accuracy))

results

[('Logistic Regression', 1.0),
 ('Decision Tree', 1.0),
 ('Random Forest', 1.0),
 ('SVM', 1.0),
 ('KNN', 1.0)]

ChatGPT then returned an easy-to-understand explanation of these results.

Next was clustering and visualization with dimensionality reduction. This is another thing that is quietly annoying to write yourself because you end up checking the documentation, but it generated the code quickly. It first used PCA for dimensionality reduction, and when I asked what would happen with t-SNE, the graph appeared right away.

The clustering and dimensionality-reduction plots with PCA and t-SNE looked like this:

ChatGPT cannot properly analyze unknown data it has never seen, although few-shot examples can help. A service that lets it execute a notebook, observe the results, and then continue the conversation through ChatGPT complements that weakness well. It made me feel again that Noteable is impressive.

People who do data analysis probably see Iris and think, "Ah, another Iris tutorial", and do not feel like analyzing it again. I was surprised that a day came when I voluntarily wanted to analyze the Iris dataset again.