<-

What is Machine Learning

What is Machine Learning

机器学习是一种让计算机模拟人脑从数据中学习并做出决策的过程,而不是过去那样利用计算机通过明确的编程指令完成特定任务。

也有说法是,

碳基生物是硅基生物的启动程序。

谁知道呢?

机器学习由数据和模型构成。机器学习的实现依赖数据,这个数据可以是一切形式的信息;模型是机器学习的核心,它是一个数学结构,用于基于数据做出预测。

根据提供的数据类型和模型机制,机器学习分为 监督学习无监督学习强化学习

  1. 监督学习 (Supervised Learning) 中,给定一组目标对象的特征数据,及其对应的标签,让计算机模型学习进行预测。代表性的算法有线性回归 (Linear Regression)、逻辑回归 (Logistic Regression)、决策树 (Decision Tree) 等。

  2. 无监督学习 (Unsupervised Learning) 中,只给计算机提供输入数据,让模型尝试找到数据中的模式和分类。代表性的算法包括聚类 (Clustering)、主成分分析 (Principal Component Analysis)。

  3. 强化学习 (Reinforcement Learning) 类似于训练宠物,通过给试错后的模型以奖励或者惩罚,来达到想要的效果。代表性算法有 Q-learning、深度 Q 网络 (Deep Q-Network)。

Machine Learning Library

为了帮助开发者更容易构建、训练和评估机器学习模型,一些机器学习库被开发出来。这些机器学习库由一些预先编写的代码和工具构成,它们提供了许多常用的算法、数据处理工具,使机器学习的实现更加高效和便捷。

常用的机器学习库包括 Scikit-learn, TensorFlow, PyTorch

How to Machine Learning

下面以 Scikit-learn 为例,简单跑一遍机器学习的流程。

Install Necessary Packages

pip install scikit-learn pandas

Hello World

# 导入必须的包
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# 加载数据集(Scikit-learn 自带的鸢尾花数据集)
iris = datasets.load_iris()
X = iris.data  # 特征
y = iris.target  # 标签

# X = array([[5.1, 3.5, 1.4, 0.2],
#        [4.9, 3. , 1.4, 0.2],
#        [4.7, 3.2, 1.3, 0.2],
#        [4.6, 3.1, 1.5, 0.2],
#        [5. , 3.6, 1.4, 0.2],
#        [5.4, 3.9, 1.7, 0.4],
#        [4.6, 3.4, 1.4, 0.3],
#        [5. , 3.4, 1.5, 0.2],
#        [4.4, 2.9, 1.4, 0.2],
#        [4.9, 3.1, 1.5, 0.1],
#        [5.4, 3.7, 1.5, 0.2],
#        [4.8, 3.4, 1.6, 0.2],
#        [4.8, 3. , 1.4, 0.1],
#        [4.3, 3. , 1.1, 0.1],
#        [5.8, 4. , 1.2, 0.2],
#        [5.7, 4.4, 1.5, 0.4],
#        [5.4, 3.9, 1.3, 0.4],
#        [5.1, 3.5, 1.4, 0.3],
#        [5.7, 3.8, 1.7, 0.3],
#        [5.1, 3.8, 1.5, 0.3],
#        [5.4, 3.4, 1.7, 0.2],
#        [5.1, 3.7, 1.5, 0.4],
#        [4.6, 3.6, 1. , 0.2],
#        [5.1, 3.3, 1.7, 0.5],
#        [4.8, 3.4, 1.9, 0.2],
#        [5. , 3. , 1.6, 0.2],
#        [5. , 3.4, 1.6, 0.4],
#        [5.2, 3.5, 1.5, 0.2],
#        [5.2, 3.4, 1.4, 0.2],
#        [4.7, 3.2, 1.6, 0.2],
#        [4.8, 3.1, 1.6, 0.2],
#        [5.4, 3.4, 1.5, 0.4],
#        [5.2, 4.1, 1.5, 0.1],
#        [5.5, 4.2, 1.4, 0.2],
#        [4.9, 3.1, 1.5, 0.2],
#        [5. , 3.2, 1.2, 0.2],
#        [5.5, 3.5, 1.3, 0.2],
#        [4.9, 3.6, 1.4, 0.1],
#        [4.4, 3. , 1.3, 0.2],
#        [5.1, 3.4, 1.5, 0.2],
#        [5. , 3.5, 1.3, 0.3],
#        [4.5, 2.3, 1.3, 0.3],
#        [4.4, 3.2, 1.3, 0.2],
#        [5. , 3.5, 1.6, 0.6],
#        [5.1, 3.8, 1.9, 0.4],
#        [4.8, 3. , 1.4, 0.3],
#        [5.1, 3.8, 1.6, 0.2],
#        [4.6, 3.2, 1.4, 0.2],
#        [5.3, 3.7, 1.5, 0.2],
#        [5. , 3.3, 1.4, 0.2],
#        [7. , 3.2, 4.7, 1.4],
#        [6.4, 3.2, 4.5, 1.5],
#        [6.9, 3.1, 4.9, 1.5],
#        [5.5, 2.3, 4. , 1.3],
#        [6.5, 2.8, 4.6, 1.5],
#        [5.7, 2.8, 4.5, 1.3],
#        [6.3, 3.3, 4.7, 1.6],
#        [4.9, 2.4, 3.3, 1. ],
#        [6.6, 2.9, 4.6, 1.3],
#        [5.2, 2.7, 3.9, 1.4],
#        [5. , 2. , 3.5, 1. ],
#        [5.9, 3. , 4.2, 1.5],
#        [6. , 2.2, 4. , 1. ],
#        [6.1, 2.9, 4.7, 1.4],
#        [5.6, 2.9, 3.6, 1.3],
#        [6.7, 3.1, 4.4, 1.4],
#        [5.6, 3. , 4.5, 1.5],
#        [5.8, 2.7, 4.1, 1. ],
#        [6.2, 2.2, 4.5, 1.5],
#        [5.6, 2.5, 3.9, 1.1],
#        [5.9, 3.2, 4.8, 1.8],
#        [6.1, 2.8, 4. , 1.3],
#        [6.3, 2.5, 4.9, 1.5],
#        [6.1, 2.8, 4.7, 1.2],
#        [6.4, 2.9, 4.3, 1.3],
#        [6.6, 3. , 4.4, 1.4],
#        [6.8, 2.8, 4.8, 1.4],
#        [6.7, 3. , 5. , 1.7],
#        [6. , 2.9, 4.5, 1.5],
#        [5.7, 2.6, 3.5, 1. ],
#        [5.5, 2.4, 3.8, 1.1],
#        [5.5, 2.4, 3.7, 1. ],
#        [5.8, 2.7, 3.9, 1.2],
#        [6. , 2.7, 5.1, 1.6],
#        [5.4, 3. , 4.5, 1.5],
#        [6. , 3.4, 4.5, 1.6],
#        [6.7, 3.1, 4.7, 1.5],
#        [6.3, 2.3, 4.4, 1.3],
#        [5.6, 3. , 4.1, 1.3],
#        [5.5, 2.5, 4. , 1.3],
#        [5.5, 2.6, 4.4, 1.2],
#        [6.1, 3. , 4.6, 1.4],
#        [5.8, 2.6, 4. , 1.2],
#        [5. , 2.3, 3.3, 1. ],
#        [5.6, 2.7, 4.2, 1.3],
#        [5.7, 3. , 4.2, 1.2],
#        [5.7, 2.9, 4.2, 1.3],
#        [6.2, 2.9, 4.3, 1.3],
#        [5.1, 2.5, 3. , 1.1],
#        [5.7, 2.8, 4.1, 1.3],
#        [6.3, 3.3, 6. , 2.5],
#        [5.8, 2.7, 5.1, 1.9],
#        [7.1, 3. , 5.9, 2.1],
#        [6.3, 2.9, 5.6, 1.8],
#        [6.5, 3. , 5.8, 2.2],
#        [7.6, 3. , 6.6, 2.1],
#        [4.9, 2.5, 4.5, 1.7],
#        [7.3, 2.9, 6.3, 1.8],
#        [6.7, 2.5, 5.8, 1.8],
#        [7.2, 3.6, 6.1, 2.5],
#        [6.5, 3.2, 5.1, 2. ],
#        [6.4, 2.7, 5.3, 1.9],
#        [6.8, 3. , 5.5, 2.1],
#        [5.7, 2.5, 5. , 2. ],
#        [5.8, 2.8, 5.1, 2.4],
#        [6.4, 3.2, 5.3, 2.3],
#        [6.5, 3. , 5.5, 1.8],
#        [7.7, 3.8, 6.7, 2.2],
#        [7.7, 2.6, 6.9, 2.3],
#        [6. , 2.2, 5. , 1.5],
#        [6.9, 3.2, 5.7, 2.3],
#        [5.6, 2.8, 4.9, 2. ],
#        [7.7, 2.8, 6.7, 2. ],
#        [6.3, 2.7, 4.9, 1.8],
#        [6.7, 3.3, 5.7, 2.1],
#        [7.2, 3.2, 6. , 1.8],
#        [6.2, 2.8, 4.8, 1.8],
#        [6.1, 3. , 4.9, 1.8],
#        [6.4, 2.8, 5.6, 2.1],
#        [7.2, 3. , 5.8, 1.6],
#        [7.4, 2.8, 6.1, 1.9],
#        [7.9, 3.8, 6.4, 2. ],
#        [6.4, 2.8, 5.6, 2.2],
#        [6.3, 2.8, 5.1, 1.5],
#        [6.1, 2.6, 5.6, 1.4],
#        [7.7, 3. , 6.1, 2.3],
#        [6.3, 3.4, 5.6, 2.4],
#        [6.4, 3.1, 5.5, 1.8],
#        [6. , 3. , 4.8, 1.8],
#        [6.9, 3.1, 5.4, 2.1],
#        [6.7, 3.1, 5.6, 2.4],
#        [6.9, 3.1, 5.1, 2.3],
#        [5.8, 2.7, 5.1, 1.9],
#        [6.8, 3.2, 5.9, 2.3],
#        [6.7, 3.3, 5.7, 2.5],
#        [6.7, 3. , 5.2, 2.3],
#        [6.3, 2.5, 5. , 1.9],
#        [6.5, 3. , 5.2, 2. ],
#        [6.2, 3.4, 5.4, 2.3],
#        [5.9, 3. , 5.1, 1.8]])

# y = array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#        0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
#        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
#        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
#        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
#        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

# 将数据集划分为训练集(20%)和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 标准化特征
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 使用 K-Nearest Neighbors 算法进行分类,创建 KNN 模型
model = KNeighborsClassifier(n_neighbors=3)

# 训练模型
model.fit(X_train, y_train)

# 进行预测
y_pred = model.predict(X_test)

# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print(f"模型准确率: {accuracy:.2f}")
# 模型准确率: 1.00

# 打印分类报告
print("分类报告:")
print(classification_report(y_test, y_pred))

# 分类报告:
#               precision    recall  f1-score   support

#            0       1.00      1.00      1.00        10
#            1       1.00      1.00      1.00         9
#            2       1.00      1.00      1.00        11

#     accuracy                           1.00        30
#    macro avg       1.00      1.00      1.00        30
# weighted avg       1.00      1.00      1.00        30

# 打印混淆矩阵
print("混淆矩阵:")
print(confusion_matrix(y_test, y_pred))
# [[10  0  0]
#  [ 0  9  0]
#  [ 0  0 11]]

想要深入实践掌握 Machine Learning,除了以上提到的一些官网,还可以尝试从 Kaggle 找到数据集并构建自己感兴趣的项目,在机器学习社群 (Stack OverflowKaggle DiscussionsReddit) 参与其他学习者的交流,跟踪技术的发展。

需要的技能:线性代数、概率论和统计学,数据处理 (NumPy & Pandas) 和可视化 (Matplotlib & Seaborn)。

多动手,多观察,多总结。

To Be Continued…


BTW The Royal Swedish Academy of Sciences announced on October 8: The Nobel Prize in Physics 2024 was awarded to John J. Hopfield and Geoffrey E. Hinton “for foundational discoveries and inventions that enable machine learning with artificial neural networks”.