What is Machine Learning
What is Machine Learning
机器学习是一种让计算机模拟人脑从数据中学习并做出决策的过程,而不是过去那样利用计算机通过明确的编程指令完成特定任务。
也有说法是,
碳基生物是硅基生物的启动程序。
谁知道呢?
机器学习由数据和模型构成。机器学习的实现依赖数据,这个数据可以是一切形式的信息;模型是机器学习的核心,它是一个数学结构,用于基于数据做出预测。
根据提供的数据类型和模型机制,机器学习分为 监督学习、无监督学习 和 强化学习。
-
监督学习 (Supervised Learning) 中,给定一组目标对象的特征数据,及其对应的标签,让计算机模型学习进行预测。代表性的算法有线性回归 (Linear Regression)、逻辑回归 (Logistic Regression)、决策树 (Decision Tree) 等。
-
无监督学习 (Unsupervised Learning) 中,只给计算机提供输入数据,让模型尝试找到数据中的模式和分类。代表性的算法包括聚类 (Clustering)、主成分分析 (Principal Component Analysis)。
-
强化学习 (Reinforcement Learning) 类似于训练宠物,通过给试错后的模型以奖励或者惩罚,来达到想要的效果。代表性算法有 Q-learning、深度 Q 网络 (Deep Q-Network)。
Machine Learning Library
为了帮助开发者更容易构建、训练和评估机器学习模型,一些机器学习库被开发出来。这些机器学习库由一些预先编写的代码和工具构成,它们提供了许多常用的算法、数据处理工具,使机器学习的实现更加高效和便捷。
常用的机器学习库包括 Scikit-learn, TensorFlow, PyTorch。
How to Machine Learning
下面以 Scikit-learn 为例,简单跑一遍机器学习的流程。
Install Necessary Packages
pip install scikit-learn pandas
Hello World
# 导入必须的包
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# 加载数据集(Scikit-learn 自带的鸢尾花数据集)
iris = datasets.load_iris()
X = iris.data # 特征
y = iris.target # 标签
# X = array([[5.1, 3.5, 1.4, 0.2],
# [4.9, 3. , 1.4, 0.2],
# [4.7, 3.2, 1.3, 0.2],
# [4.6, 3.1, 1.5, 0.2],
# [5. , 3.6, 1.4, 0.2],
# [5.4, 3.9, 1.7, 0.4],
# [4.6, 3.4, 1.4, 0.3],
# [5. , 3.4, 1.5, 0.2],
# [4.4, 2.9, 1.4, 0.2],
# [4.9, 3.1, 1.5, 0.1],
# [5.4, 3.7, 1.5, 0.2],
# [4.8, 3.4, 1.6, 0.2],
# [4.8, 3. , 1.4, 0.1],
# [4.3, 3. , 1.1, 0.1],
# [5.8, 4. , 1.2, 0.2],
# [5.7, 4.4, 1.5, 0.4],
# [5.4, 3.9, 1.3, 0.4],
# [5.1, 3.5, 1.4, 0.3],
# [5.7, 3.8, 1.7, 0.3],
# [5.1, 3.8, 1.5, 0.3],
# [5.4, 3.4, 1.7, 0.2],
# [5.1, 3.7, 1.5, 0.4],
# [4.6, 3.6, 1. , 0.2],
# [5.1, 3.3, 1.7, 0.5],
# [4.8, 3.4, 1.9, 0.2],
# [5. , 3. , 1.6, 0.2],
# [5. , 3.4, 1.6, 0.4],
# [5.2, 3.5, 1.5, 0.2],
# [5.2, 3.4, 1.4, 0.2],
# [4.7, 3.2, 1.6, 0.2],
# [4.8, 3.1, 1.6, 0.2],
# [5.4, 3.4, 1.5, 0.4],
# [5.2, 4.1, 1.5, 0.1],
# [5.5, 4.2, 1.4, 0.2],
# [4.9, 3.1, 1.5, 0.2],
# [5. , 3.2, 1.2, 0.2],
# [5.5, 3.5, 1.3, 0.2],
# [4.9, 3.6, 1.4, 0.1],
# [4.4, 3. , 1.3, 0.2],
# [5.1, 3.4, 1.5, 0.2],
# [5. , 3.5, 1.3, 0.3],
# [4.5, 2.3, 1.3, 0.3],
# [4.4, 3.2, 1.3, 0.2],
# [5. , 3.5, 1.6, 0.6],
# [5.1, 3.8, 1.9, 0.4],
# [4.8, 3. , 1.4, 0.3],
# [5.1, 3.8, 1.6, 0.2],
# [4.6, 3.2, 1.4, 0.2],
# [5.3, 3.7, 1.5, 0.2],
# [5. , 3.3, 1.4, 0.2],
# [7. , 3.2, 4.7, 1.4],
# [6.4, 3.2, 4.5, 1.5],
# [6.9, 3.1, 4.9, 1.5],
# [5.5, 2.3, 4. , 1.3],
# [6.5, 2.8, 4.6, 1.5],
# [5.7, 2.8, 4.5, 1.3],
# [6.3, 3.3, 4.7, 1.6],
# [4.9, 2.4, 3.3, 1. ],
# [6.6, 2.9, 4.6, 1.3],
# [5.2, 2.7, 3.9, 1.4],
# [5. , 2. , 3.5, 1. ],
# [5.9, 3. , 4.2, 1.5],
# [6. , 2.2, 4. , 1. ],
# [6.1, 2.9, 4.7, 1.4],
# [5.6, 2.9, 3.6, 1.3],
# [6.7, 3.1, 4.4, 1.4],
# [5.6, 3. , 4.5, 1.5],
# [5.8, 2.7, 4.1, 1. ],
# [6.2, 2.2, 4.5, 1.5],
# [5.6, 2.5, 3.9, 1.1],
# [5.9, 3.2, 4.8, 1.8],
# [6.1, 2.8, 4. , 1.3],
# [6.3, 2.5, 4.9, 1.5],
# [6.1, 2.8, 4.7, 1.2],
# [6.4, 2.9, 4.3, 1.3],
# [6.6, 3. , 4.4, 1.4],
# [6.8, 2.8, 4.8, 1.4],
# [6.7, 3. , 5. , 1.7],
# [6. , 2.9, 4.5, 1.5],
# [5.7, 2.6, 3.5, 1. ],
# [5.5, 2.4, 3.8, 1.1],
# [5.5, 2.4, 3.7, 1. ],
# [5.8, 2.7, 3.9, 1.2],
# [6. , 2.7, 5.1, 1.6],
# [5.4, 3. , 4.5, 1.5],
# [6. , 3.4, 4.5, 1.6],
# [6.7, 3.1, 4.7, 1.5],
# [6.3, 2.3, 4.4, 1.3],
# [5.6, 3. , 4.1, 1.3],
# [5.5, 2.5, 4. , 1.3],
# [5.5, 2.6, 4.4, 1.2],
# [6.1, 3. , 4.6, 1.4],
# [5.8, 2.6, 4. , 1.2],
# [5. , 2.3, 3.3, 1. ],
# [5.6, 2.7, 4.2, 1.3],
# [5.7, 3. , 4.2, 1.2],
# [5.7, 2.9, 4.2, 1.3],
# [6.2, 2.9, 4.3, 1.3],
# [5.1, 2.5, 3. , 1.1],
# [5.7, 2.8, 4.1, 1.3],
# [6.3, 3.3, 6. , 2.5],
# [5.8, 2.7, 5.1, 1.9],
# [7.1, 3. , 5.9, 2.1],
# [6.3, 2.9, 5.6, 1.8],
# [6.5, 3. , 5.8, 2.2],
# [7.6, 3. , 6.6, 2.1],
# [4.9, 2.5, 4.5, 1.7],
# [7.3, 2.9, 6.3, 1.8],
# [6.7, 2.5, 5.8, 1.8],
# [7.2, 3.6, 6.1, 2.5],
# [6.5, 3.2, 5.1, 2. ],
# [6.4, 2.7, 5.3, 1.9],
# [6.8, 3. , 5.5, 2.1],
# [5.7, 2.5, 5. , 2. ],
# [5.8, 2.8, 5.1, 2.4],
# [6.4, 3.2, 5.3, 2.3],
# [6.5, 3. , 5.5, 1.8],
# [7.7, 3.8, 6.7, 2.2],
# [7.7, 2.6, 6.9, 2.3],
# [6. , 2.2, 5. , 1.5],
# [6.9, 3.2, 5.7, 2.3],
# [5.6, 2.8, 4.9, 2. ],
# [7.7, 2.8, 6.7, 2. ],
# [6.3, 2.7, 4.9, 1.8],
# [6.7, 3.3, 5.7, 2.1],
# [7.2, 3.2, 6. , 1.8],
# [6.2, 2.8, 4.8, 1.8],
# [6.1, 3. , 4.9, 1.8],
# [6.4, 2.8, 5.6, 2.1],
# [7.2, 3. , 5.8, 1.6],
# [7.4, 2.8, 6.1, 1.9],
# [7.9, 3.8, 6.4, 2. ],
# [6.4, 2.8, 5.6, 2.2],
# [6.3, 2.8, 5.1, 1.5],
# [6.1, 2.6, 5.6, 1.4],
# [7.7, 3. , 6.1, 2.3],
# [6.3, 3.4, 5.6, 2.4],
# [6.4, 3.1, 5.5, 1.8],
# [6. , 3. , 4.8, 1.8],
# [6.9, 3.1, 5.4, 2.1],
# [6.7, 3.1, 5.6, 2.4],
# [6.9, 3.1, 5.1, 2.3],
# [5.8, 2.7, 5.1, 1.9],
# [6.8, 3.2, 5.9, 2.3],
# [6.7, 3.3, 5.7, 2.5],
# [6.7, 3. , 5.2, 2.3],
# [6.3, 2.5, 5. , 1.9],
# [6.5, 3. , 5.2, 2. ],
# [6.2, 3.4, 5.4, 2.3],
# [5.9, 3. , 5.1, 1.8]])
# y = array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
# 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
# 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
# 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
# 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
# 将数据集划分为训练集(20%)和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 标准化特征
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# 使用 K-Nearest Neighbors 算法进行分类,创建 KNN 模型
model = KNeighborsClassifier(n_neighbors=3)
# 训练模型
model.fit(X_train, y_train)
# 进行预测
y_pred = model.predict(X_test)
# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print(f"模型准确率: {accuracy:.2f}")
# 模型准确率: 1.00
# 打印分类报告
print("分类报告:")
print(classification_report(y_test, y_pred))
# 分类报告:
# precision recall f1-score support
# 0 1.00 1.00 1.00 10
# 1 1.00 1.00 1.00 9
# 2 1.00 1.00 1.00 11
# accuracy 1.00 30
# macro avg 1.00 1.00 1.00 30
# weighted avg 1.00 1.00 1.00 30
# 打印混淆矩阵
print("混淆矩阵:")
print(confusion_matrix(y_test, y_pred))
# [[10 0 0]
# [ 0 9 0]
# [ 0 0 11]]
Useful Links
想要深入实践掌握 Machine Learning,除了以上提到的一些官网,还可以尝试从 Kaggle 找到数据集并构建自己感兴趣的项目,在机器学习社群 (Stack Overflow、Kaggle Discussions、Reddit) 参与其他学习者的交流,跟踪技术的发展。
需要的技能:线性代数、概率论和统计学,数据处理 (NumPy & Pandas) 和可视化 (Matplotlib & Seaborn)。
多动手,多观察,多总结。
To Be Continued…
BTW The Royal Swedish Academy of Sciences announced on October 8: The Nobel Prize in Physics 2024 was awarded to John J. Hopfield and Geoffrey E. Hinton “for foundational discoveries and inventions that enable machine learning with artificial neural networks”.