机器学习 10大算法



1. KNN 分类算法

  • 适用场景:适用于分类和回归问题,特别适合于多分类问题,适合对稀有事件进行分类。
  • 优点:简单,易于理解,易于实现,无需估计参数。可以处理分类问题,同时天然可以处理多分类问题,适合对异常点不敏感。
  • 缺点:计算量太大,尤其是特征数非常多的时候。可理解性差,无法给出像决策树那样的规则。是慵懒散学习方法,基本上不学习,导致预测时速度比起逻辑回归之类的算法慢。样本不平衡的时候,对稀有类别的预测准确率低。对训练数据依赖度特别大,对训练数据的容错性太差。

2. 线性回归

  • 适用场景:适用于预测数值型数据的监督学习算法,适用于线性可分和特征空间不太大的情况。
  • 优点:模型简单,易于理解和实现,计算效率高。
  • 缺点:对异常值敏感,对特征相关性敏感,对特征空间的规模有限制,对非线性问题表现不佳。

3. 逻辑回归

  • 适用场景:最常用于解决二分类问题,但也可以扩展到多分类问题。可以用于预测某一事件发生的概率。
  • 优点:模型解释性强,适用于线性可分数据,计算效率高,可用于多分类问题。
  • 缺点:对异常值敏感,对特征相关性敏感,对特征空间的规模有限制,对非线性问题表现不佳。

4. 支持向量机(SVM)

  • 适用场景:适用于分类和回归分析,特别适用于非线性问题和高维数据。
  • 优点:可以解决高维问题,解决小样本下机器学习问题,能够处理非线性特征的相互作用,无局部极小值问题,泛化能力比较强。
  • 缺点:当观测样本很多时,效率并不是很高;对非线性问题没有通用解决方案,有时候很难找到一个合适的核函数;对核函数的高维映射解释力不强,尤其是径向基函数;常规SVM只支持二分类;对缺失数据敏感。

5. 决策树

  • 适用场景:适用于分类和回归问题,可以处理连续和种类字段。
  • 优点:可以生成可以理解的规则,计算量相对不是很大,可以处理连续和种类字段,可以清晰的显示哪些字段比较重要。
  • 缺点:对连续型字段比较难预测,对于有时间顺序数据,需要许多预处理工作,当类别较多时,错误可能增加的比较快,对处理特征关联性比较强的数据时,表现的不是太好。

6. 随机森林

  • 适用场景:适用于分类和回归问题,可以处理高维数据,不需要进行特征选择,可以处理缺失值和异常值。
  • 优点:随机选择特征和样本,减少了过拟合的风险,可以处理高维数据,不需要进行特征选择,可以处理缺失值和异常值,可以评估每个特征的重要性,用于特征选择和解释模型。
  • 缺点:随机森林分类器的训练时间比单棵决策树长,需要构建多棵决策树,随机森林分类器的模型比较复杂,不易解释。

7. 朴素贝叶斯

  • 适用场景:适用于文本分类、情感分析、疾病诊断辅助等。
  • 优点:算法简单易懂,容易实现,对小规模数据表现良好,对缺失数据不太敏感。
  • 缺点:假设特征之间相互独立,这在很多实际情况中并不成立,对输入数据的准备方式(如离散化、特征选择等)比较敏感。

8. 梯度提升(Gradient Boosting)

  • 适用场景:适用于回归问题(线性和非线性);也可用于二分类问题(设定阈值,大于为正,否则为负)和多分类问题。
  • 优点:可以灵活处理各种类型的数据,包括连续值和离散值,在相对少的调参时间情况下,预测的准备率也可以比较高,使用一些健壮的损失函数,对异常值的鲁棒性非常强,很好的利用了弱分类器进行级联,充分考虑的每个分类器的权重。
  • 缺点:由于弱学习器之间存在依赖关系,难以并行训练数据。

9. 集成学习

  • 适用场景:适用于需要提高模型泛化能力和性能的场景,可以减少过拟合,提高模型的鲁棒性和可解释性。
  • 优点:减少过拟合,提高模型的泛化能力和性能,提高模型的鲁棒性和可解释性。
  • 缺点:需要较高的计算资源和时间成本,可能导致模型的复杂性增加。

10. 神经网络

  • 适用场景:适用于图像识别、语音识别、自然语言处理等复杂任务。
  • 优点:能够自动学习数据的复杂特征表示,具有很强的表达能力。
  • 缺点:训练过程复杂且计算资源消耗大,容易过拟合。



1. KNN 分类算法

原理:KNN(K-Nearest Neighbors)算法是一种基于实例的学习,或者说是懒惰学习。它的核心思想是在预测新数据的类别时,不是通过训练学习输入数据到输出数据的映射关系,而是直接在分类时,将该数据与训练数据进行对比,找出与之最为相似的K个训练实例,然后根据这些实例的标签决定新数据的标签。


from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# 加载数据集
iris = load_iris()
X, y = iris.data, iris.target

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建KNN分类器实例
knn = KNeighborsClassifier(n_neighbors=3)

# 训练模型
knn.fit(X_train, y_train)

# 预测测试集
y_pred = knn.predict(X_test)

2. 线性回归

原理:线性回归是一种预测数值型数据的监督学习算法。它通过拟合最佳直线来建立自变量和因变量的关系。这条最佳直线叫做回归线,并且用 Y = a * X + b 这条线性等式来表示。


from sklearn.linear_model import LinearRegression
import numpy as np

# 创建数据集
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1, 2, 1.3, 3.75, 2.25])

# 创建线性回归模型实例
lin_reg = LinearRegression()

# 训练模型
lin_reg.fit(X, y)

# 预测
y_pred = lin_reg.predict(X)

3. 逻辑回归



from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# 创建数据集
X, y = make_classification(n_samples=100, n_features=2, n_classes=2, random_state=42)

# 创建逻辑回归模型实例
log_reg = LogisticRegression()

# 训练模型
log_reg.fit(X, y)

# 预测
y_pred = log_reg.predict(X)

4. 支持向量机(SVM)



from sklearn.svm import SVC

# 创建SVM分类器实例
svm_clf = SVC(kernel='linear')

# 训练模型(使用上面的X_train, y_train)
svm_clf.fit(X_train, y_train)

# 预测(使用上面的X_test)
y_pred = svm_clf.predict(X_test)

5. 决策树



from sklearn.tree import DecisionTreeClassifier

# 创建决策树分类器实例
dec_tree = DecisionTreeClassifier()

# 训练模型(使用上面的X_train, y_train)
dec_tree.fit(X_train, y_train)

# 预测(使用上面的X_test)
y_pred = dec_tree.predict(X_test)

6. 随机森林



from sklearn.ensemble import RandomForestClassifier

# 创建随机森林分类器实例
rand_forest = RandomForestClassifier(n_estimators=100)

# 训练模型(使用上面的X_train, y_train)
rand_forest.fit(X_train, y_train)

# 预测(使用上面的X_test)
y_pred = rand_forest.predict(X_test)

7. 朴素贝叶斯



from sklearn.naive_bayes import GaussianNB

# 创建朴素贝叶斯分类器实例
nb = GaussianNB()

# 训练模型(使用上面的X_train, y_train)
nb.fit(X_train, y_train)

# 预测(使用上面的X_test)
y_pred = nb.predict(X_test)

8. K-均值聚类



from sklearn.cluster import KMeans

# 创建K-均值聚类实例
kmeans = KMeans(n_clusters=3)

# 训练模型(使用上面的X_train)

# 预测
y_pred = kmeans.predict(X_train)

9. 主成分分析(PCA)



from sklearn.decomposition import PCA

# 创建PCA实例
pca = PCA(n_components=2)

# 训练模型(使用上面的X_train)

# 降维
X_train_pca = pca.transform(X_train)

10. 梯度提升(Gradient Boosting)



from sklearn.ensemble import GradientBoostingClassifier

# 创建梯度提升分类器实例
gb_clf = GradientBoostingClassifier(n_estimators=100)

# 训练模型(使用上面的X_train, y_train)
gb_clf.fit(X_train, y_train)

# 预测(使用上面的X_test)
y_pred = gb_clf.predict(X_test)


KNN(K-Nearest Neighbors)算法的核心思想是,对于一个待分类的样本,算法会在训练集中找到与其最近的K个样本(即K个邻居),然后根据这些邻居的标签来确定待分类样本的标签。在Python中,我们可以使用scikit-learn库来轻松实现KNN算法,但为了更好地理解KNN算法,下面我将提供一个手写实现的版本,不依赖于scikit-learn


1. KNN 分类


import numpy as np

def euclidean_distance(x1, x2):
    return np.sqrt(np.sum((x1 - x2) ** 2))


class KNN:
    def __init__(self, k=3):
        self.k = k

    def fit(self, X, y):
        """X is the feature matrix and y is the label vector"""
        self.X_train = X
        self.y_train = y

    def predict(self, X):
        """X is the feature matrix of the test data"""
        predicted_labels = [self._predict(x) for x in X]
        return np.array(predicted_labels)

    def _predict(self, x):
        # 计算待测样本与训练集中所有样本的距离
        distances = [euclidean_distance(x, x_train) for x_train in self.X_train]
        # 获取与之最近的k个邻居的索引
        k_indices = np.argsort(distances)[:self.k]
        # 获取这些邻居的标签
        k_nearest_labels = [self.y_train[i] for i in k_indices]
        # 采用投票机制,多数类标签作为预测标签
        most_common = np.argmax(np.bincount(k_nearest_labels))
        return most_common


# 假设我们有一些训练数据和标签
X_train = np.array([[1, 2], [2, 3], [3, 1], [6, 5], [7, 7], [8, 6]])
y_train = np.array([0, 0, 0, 1, 1, 1])

# 假设我们有一些测试数据
X_test = np.array([[1, 1], [5, 5]])

# 创建KNN分类器实例,设置k=3
knn = KNN(k=3)

# 训练模型
knn.fit(X_train, y_train)

# 进行预测
predictions = knn.predict(X_test)

print(predictions)  # 输出预测结果



线性回归是一种预测数值型数据的监督学习算法。它的基本形式是:[ y = wx + b ],其中 ( w ) 是权重,( b ) 是偏置项,而 ( x ) 是特征,( y ) 是目标值。


import numpy as np

# 定义线性回归类
class LinearRegression:
    def __init__(self):
        self.w = None
        self.b = None

    # 训练模型的方法
    def fit(self, X, y, learning_rate=0.01, n_iterations=1000):
        # 初始化权重和偏置
        self.w = np.zeros(X.shape[1])
        self.b = 0
        # 梯度下降
        for _ in range(n_iterations):
            # 预测值
            y_pred = np.dot(X, self.w) + self.b
            # 计算梯度
            dw = (-2/X.shape[0]) * np.dot(X.T, (y - y_pred))
            db = (-2/X.shape[0]) * np.sum(y - y_pred)
            # 更新权重和偏置
            self.w -= learning_rate * dw
            self.b -= learning_rate * db

    # 预测新数据的方法
    def predict(self, X):
        return np.dot(X, self.w) + self.b

# 为简单起见,我们创建一些合成数据
# 真实权重为1.5,真实偏置为0.5
X = 2 * np.random.rand(100, 1)
y = 3 + 2.5 * X.squeeze() + np.random.randn(100)

# 创建线性回归模型实例
model = LinearRegression()

# 训练模型
model.fit(X, y)

# 进行预测
predictions = model.predict(X)

# 打印预测值和真实值
print("Predictions:", predictions)
print("Real values:", y)






import numpy as np

# 定义sigmoid函数
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# 定义逻辑回归类
class LogisticRegression:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None

    # 训练模型的方法
    def fit(self, X, y):
        # 初始化参数
        self.weights = np.zeros(X.shape[1])
        self.bias = 0
        # 梯度下降
        for _ in range(self.n_iterations):
            # 计算模型的线性组合
            linear_model = np.dot(X, self.weights) + self.bias
            # 应用sigmoid函数
            y_predicted = sigmoid(linear_model)
            # 计算梯度
            dw = (1 / X.shape[0]) * np.dot(X.T, (y_predicted - y))
            db = (1 / X.shape[0]) * np.sum(y_predicted - y)
            # 更新权重和偏置
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

    # 预测新数据的方法
    def predict(self, X):
        linear_model = np.dot(X, self.weights) + self.bias
        y_predicted = sigmoid(linear_model)
        y_predicted_cls = [1 if i > 0.5 else 0 for i in y_predicted]
        return np.array(y_predicted_cls)

    # 预测概率的方法
    def predict_proba(self, X):
        linear_model = np.dot(X, self.weights) + self.bias
        y_predicted = sigmoid(linear_model)
        return y_predicted

# 创建一些合成数据
# 假设我们有一些二分类的数据
X = np.array([[0.5, 1.5], [1, 2], [2, 2.5], [3, 4], [5, 5]])
y = np.array([0, 0, 0, 1, 1])

# 创建逻辑回归模型实例
model = LogisticRegression(learning_rate=0.01, n_iterations=1000)

# 训练模型
model.fit(X, y)

# 进行预测
predictions = model.predict(X)

# 打印预测值和真实值
print("Predictions:", predictions)
print("Real values:", y)







import numpy as np

class SVM:
    def __init__(self, learning_rate=0.001, lambda_param=0.01, n_iterations=1000):
        self.lr = learning_rate
        self.lambda_param = lambda_param
        self.n_iterations = n_iterations
        self.w = None
        self.b = None

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.w = np.zeros(n_features)
        self.b = 0

        for _ in range(self.n_iterations):
            for idx, x_i in enumerate(X):
                condition = y[idx] * (np.dot(x_i, self.w) - self.b) >= 1
                if condition:
                    self.w -= self.lr * (2 * self.lambda_param * self.w)
                    self.w -= self.lr * (2 * self.lambda_param * self.w - np.dot(x_i, y[idx]))
                    self.b -= self.lr * y[idx]

    def predict(self, X):
        linear_output = np.dot(X, self.w) - self.b
        return np.sign(linear_output)

# 生成一些合成数据
X = np.array([[5, 5], [3, 5], [4, 3], [2, 3], [5, 3], [5, 4], [3, 5], [4, 4], [3, 3], [4, 2], [3, 2], [2, 4]])
y = np.array([-1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1])

# 创建SVM模型实例
svm = SVM(learning_rate=0.001, lambda_param=0.01, n_iterations=1000)

# 训练模型
svm.fit(X, y)

# 进行预测
predictions = svm.predict(X)

# 打印预测值和真实值
print("Predictions:", predictions)
print("Real values:", y)



5. 决策树


import numpy as np

# 定义决策树类
class DecisionTreeClassifier:
    def __init__(self, max_depth=None):
        self.max_depth = max_depth
        self.tree = None

    # 计算信息增益
    def information_gain(self, X, y, split_attribute_name):
        # 计算原始数据集的熵
        parent_entropy = self.calculate_entropy(y)
        # 计算分裂后的数据集的熵
        values = X[:, split_attribute_name]
        unique_values = np.unique(values)
        weighted_entropy = 0.0
        for value in unique_values:
            sub_y = y[values == value]
            weighted_entropy += (len(sub_y) / len(y)) * self.calculate_entropy(sub_y)
        # 计算信息增益
        information_gain = parent_entropy - weighted_entropy
        return information_gain

    # 计算熵
    def calculate_entropy(self, y):
        hist = np.bincount(y)
        ps = hist / len(y)
        return -np.sum([p * np.log2(p) for p in ps if p > 0])

    # 找到最佳分裂属性
    def best_split(self, X, y):
        best_gain = 0.0
        best_attribute = -1
        num_features = X.shape[1]
        # 遍历每个属性
        for feature in range(num_features):
            gain = self.information_gain(X, y, feature)
            if gain > best_gain:
                best_gain = gain
                best_attribute = feature
        return best_attribute

    # 创建树的节点
    def to_terminal(self, X, y, depth=0):
        num_samples, num_features = X.shape
        # 如果所有样本都属于同一个类别,则停止划分
        if len(np.unique(y)) <= 1:
            return np.unique(y)[0]
        # 如果达到最大深度,则停止划分
        if self.max_depth is not None and depth >= self.max_depth:
            return np.bincount(y).argmax()
        # 如果无法进一步分裂,则返回最常见的类别
        if len(np.unique(X[:, 0])) <= 1:
            return np.bincount(y).argmax()
        best_feature = self.best_split(X, y)
        return {best_feature: self.graft(X, y, best_feature, depth + 1)}

    # 生长树
    def graft(self, X, y, feature, depth=0):
        ret = {feature: {}}

        values = X[:, feature]
        unique_values = np.unique(values)
        for value in unique_values:
            sub_X = X[values == value]
            sub_y = y[values == value]
            ret[feature][value] = self.to_terminal(sub_X, sub_y, depth + 1)
        return ret

    # 训练模型
    def fit(self, X, y):
        self.tree = self.to_terminal(X, y)

    # 预测新样本
    def predict(self, sample):
        tree = self.tree
        for feature_index in sample:
            branch = tree[feature_index]
            if isinstance(branch, dict):
                tree = branch[sample[feature_index]]
                return branch
        return tree

# 示例数据
X = np.array([[1, 2], [2, 3], [3, 1], [6, 5], [7, 7], [8, 6]])
y = np.array([0, 0, 0, 1, 1, 1])

# 创建决策树分类器实例
dt = DecisionTreeClassifier(max_depth=2)

# 训练模型
dt.fit(X, y)

# 进行预测
predictions = [dt.predict(sample) for sample in X]

# 打印预测值和真实值
print("Predictions:", predictions)
print("Real values:", y)



6. 随机森林

随机森林(Random Forest)是一种集成学习方法,它通过构建多个决策树并结合它们的预测结果来提高整体模型的性能和准确性。以下是使用Python手写一个简单的随机森林分类器的代码实现:

import numpy as np

class DecisionTreeClassifier:
    def __init__(self, max_depth=None):
        self.max_depth = max_depth
        self.tree = None

    def fit(self, X, y):
        self.tree = self._grow_tree(X, y)

    def _grow_tree(self, X, y, depth=0):
        if len(np.unique(y)) == 1 or depth == self.max_depth:
            return np.bincount(y).argmax()

        num_samples, num_features = X.shape
        if num_samples <= 1 or num_features == 0:
            return np.bincount(y).argmax()

        best_feature, best_threshold = self._best_split(X, y)
        if best_feature is None:
            return np.bincount(y).argmax()

        left_indices = X[:, best_feature] < best_threshold
        right_indices = X[:, best_feature] >= best_threshold

        left_sub_tree = self._grow_tree(X[left_indices], y[left_indices], depth + 1)
        right_sub_tree = self._grow_tree(X[right_indices], y[right_indices], depth + 1)

        return {best_feature: (best_threshold, left_sub_tree, right_sub_tree)}

    def _best_split(self, X, y):
        best_info_gain = -1
        best_feature, best_threshold = None, None
        num_samples, num_features = X.shape

        for feature in range(num_features):
            thresholds = np.unique(X[:, feature])
            for threshold in thresholds:
                left_indices = X[:, feature] < threshold
                right_indices = X[:, feature] >= threshold

                if len(left_indices) == 0 or len(right_indices) == 0:

                p_left = len(left_indices) / num_samples
                p_right = len(right_indices) / num_samples

                left_entropy = self._calculate_entropy(y[left_indices])
                right_entropy = self._calculate_entropy(y[right_indices])
                info_gain = self._calculate_info_gain(
                    self._calculate_entropy(y), p_left * left_entropy + p_right * right_entropy

                if info_gain > best_info_gain:
                    best_info_gain = info_gain
                    best_feature, best_threshold = feature, threshold

        return best_feature, best_threshold

    def _calculate_info_gain(self, parent_entropy, child_entropy):
        return parent_entropy - child_entropy

    def _calculate_entropy(self, y):
        hist = np.bincount(y)
        ps = hist / len(y)
        return -np.sum([p * np.log2(p) for p in ps if p > 0])

    def predict(self, X):
        return [self._predict(sample, self.tree) for sample in X]

    def _predict(self, sample, tree):
        if isinstance(tree, dict):
            feature, (threshold, left, right) = next(iter(tree.items()))
            if sample[feature] < threshold:
                return self._predict(sample, left)
                return self._predict(sample, right)
            return tree

class RandomForestClassifier:
    def __init__(self, n_trees=10, max_depth=None):
        self.n_trees = n_trees
        self.max_depth = max_depth
        self.trees = []

    def fit(self, X, y):
        for _ in range(self.n_trees):
            tree = DecisionTreeClassifier(max_depth=self.max_depth)
            indices = np.random.choice(len(X), len(X), replace=True)
            tree.fit(X[indices], y[indices])

    def predict(self, X):
        predictions = [tree.predict(X) for tree in self.trees]
        return np.array([np.bincount(pred).argmax() for pred in np.array(predictions).T])

# 示例数据
X = np.array([[1, 2], [2, 3], [3, 1], [6, 5], [7, 7], [8, 6]])
y = np.array([0, 0, 0, 1, 1, 1])

# 创建随机森林分类器实例
rf = RandomForestClassifier(n_trees=10, max_depth=2)

# 训练模型
rf.fit(X, y)

# 进行预测
predictions = rf.predict(X)

# 打印预测值和真实值
print("Predictions:", predictions)
print("Real values:", y)



7. 朴素贝叶斯


import numpy as np

class NaiveBayesClassifier:
    def __init__(self):
        self.class_prior_prob = None
        self.conditional_prob = None

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.classes = np.unique(y)
        n_classes = len(self.classes)

        # 初始化概率
        self.class_prior_prob = np.zeros(n_classes)
        self.conditional_prob = np.zeros((n_classes, n_features))

        # 计算每个类的先验概率
        for idx, c in enumerate(self.classes):
            X_c = X[y == c]
            self.class_prior_prob[idx] = len(X_c) / n_samples

            # 计算每个特征的条件概率
            for i in range(n_features):
                feature_values = X_c[:, i]
                unique_values, counts = np.unique(feature_values, return_counts=True)
                probabilities = counts / len(feature_values)
                self.conditional_prob[idx, i] = probabilities

    def predict(self, X):
        y_pred = [self._predict(sample) for sample in X]
        return np.array(y_pred)

    def _predict(self, sample):
        # 计算每个类的后验概率
        posterior_prob = []
        for idx, c in enumerate(self.classes):
            prior = np.log(self.class_prior_prob[idx])
            likelihood = np.sum(np.log(self.conditional_prob[idx, :]))
            posterior = prior + likelihood

        # 返回具有最高后验概率的类
        return self.classes[np.argmax(posterior_prob)]

# 示例数据
X = np.array([
    [1, 0, 0, 1],
    [0, 1, 1, 0],
    [1, 1, 0, 0],
    [0, 1, 0, 1]
y = np.array([0, 1, 0, 1])

# 创建朴素贝叶斯分类器实例
nb = NaiveBayesClassifier()

# 训练模型
nb.fit(X, y)

# 进行预测
predictions = nb.predict(X)

# 打印预测值和真实值
print("Predictions:", predictions)
print("Real values:", y)



8. K-均值聚类


import numpy as np

class KMeans:
    def __init__(self, K=3, max_iters=100):
        self.K = K
        self.max_iters = max_iters
        self.centroids = None
        self.clusters = None

    def fit(self, X):
        # 初始化质心
        self.centroids = self._init_centroids(X, self.K)
        for _ in range(self.max_iters):
            # 将每个点分配到最近的质心
            self.clusters = self._assign_clusters(X, self.centroids)
            # 更新质心
            self.centroids = self._update_centroids(X, self.clusters)

    def _init_centroids(self, X, K):
        # 随机选择K个数据点作为初始质心
        indices = np.random.choice(X.shape[0], K, replace=False)
        return X[indices, :]

    def _assign_clusters(self, X, centroids):
        # 计算每个点到每个质心的距离,并分配到最近的质心
        clusters = {}
        for x in X:
            distances = np.linalg.norm(x - centroids, axis=1)
            cluster_idx = np.argmin(distances)
            if cluster_idx not in clusters:
                clusters[cluster_idx] = []
        return clusters

    def _update_centroids(self, X, clusters):
        # 计算每个簇的新质心
        new_centroids = []
        for idx in clusters:
            new_centroid = np.mean(clusters[idx], axis=0)
        return np.array(new_centroids)

    def predict(self, X):
        # 对新数据点进行聚类
        return self._assign_clusters(X, self.centroids)

# 示例数据
X = np.array([
    [1, 2],
    [1, 4],
    [1, 0],
    [10, 2],
    [10, 4],
    [10, 0]

# 创建K均值聚类实例
kmeans = KMeans(K=2, max_iters=100)

# 训练模型

# 打印质心

# 对数据点进行聚类
clusters = kmeans.predict(X)

# 打印聚类结果
for idx, cluster in clusters.items():
    print(f"Cluster {idx}: {cluster}")





import numpy as np

class PCA:
    def __init__(self, n_components):
        self.n_components = n_components
        self.components_ = None
        self.explained_variance_ = None

    def fit(self, X):
        # 计算协方差矩阵
        cov_matrix = np.cov(X.T)
        # 计算协方差矩阵的特征值和特征向量
        eigen_values, eigen_vectors = np.linalg.eigh(cov_matrix)
        # 排序特征向量和特征值
        idx = eigen_values.argsort()[::-1]
        eigen_values = eigen_values[idx]
        eigen_vectors = eigen_vectors[:, idx]
        # 选择前n个主成分
        self.components_ = eigen_vectors[:, :self.n_components]
        # 计算解释的方差
        total_variance = np.sum(eigen_values)
        explained_variance_ratio = (eigen_values / total_variance)[self.n_components:]
        self.explained_variance_ = explained_variance_ratio

    def transform(self, X):
        # 投影数据到主成分
        return np.dot(X, self.components_)

    def fit_transform(self, X):
        return self.transform(X)

# 示例数据
X = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
    [10, 11, 12]

# 创建PCA实例
pca = PCA(n_components=2)

# 训练模型并转换数据
X_pca = pca.fit_transform(X)

# 打印转换后的数据
print("Transformed data:")

# 打印解释的方差
print("Explained variance:")



10. 梯度提升(Gradient Boosting)

梯度提升(Gradient Boosting)是一种强大的集成学习算法,它通过迭代地训练决策树来最小化损失函数。以下是使用Python手写一个简单的梯度提升分类器的代码实现:

import numpy as np

# 决策树桩(单层决策树)
class DecisionStump:
    def __init__(self):
        self.threshold = None
        self.feature_idx = None
        self.value_left = None
        self.value_right = None

    def fit(self, X, y, loss_fn):
        n_samples, n_features = X.shape
        best_loss = np.inf
        best_threshold, best_feature_idx, best_value_left, best_value_right = None, None, None, None

        for feature_idx in range(n_features):
            thresholds = np.unique(X[:, feature_idx])
            for threshold in thresholds:
                values_left = y[X[:, feature_idx] <= threshold]
                values_right = y[X[:, feature_idx] > threshold]

                if len(values_left) == 0 or len(values_right) == 0:

                loss_left = loss_fn(values_left, np.ones(len(values_left)) / 2)
                loss_right = loss_fn(values_right, np.ones(len(values_right)) / 2)

                loss = (len(values_left) * loss_left + len(values_right) * loss_right) / n_samples

                if loss < best_loss:
                    best_loss = loss
                    best_threshold = threshold
                    best_feature_idx = feature_idx
                    best_value_left = loss_left
                    best_value_right = loss_right

        self.threshold = best_threshold
        self.feature_idx = best_feature_idx
        self.value_left = best_value_left
        self.value_right = best_value_right

    def predict(self, X):
        predictions = np.ones(X.shape[0]) / 2
        predictions[X[:, self.feature_idx] <= self.threshold] = 0
        predictions[X[:, self.feature_idx] > self.threshold] = 1
        return predictions

# 梯度提升分类器
class GradientBoostingClassifier:
    def __init__(self, n_estimators=100, learning_rate=1.0):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.estimators = []

    def fit(self, X, y):
        n_samples, _ = X.shape
        self.y_mean_ = np.mean(y)

        self.predictions_ = np.zeros(n_samples)
        self.estimators = []

        for _ in range(self.n_estimators):
            stump = DecisionStump()
            residuals = y - self.predictions_
            stump.fit(X, residuals, self._loss_fn)

            predictions = stump.predict(X)
            self.predictions_ += self.learning_rate * predictions

    def predict(self, X):
        predictions = np.zeros(X.shape[0])
        for estimator in self.estimators:
            predictions += self.learning_rate * estimator.predict(X)

        return np.where(predictions > 0.5, 1, 0)

    def _loss_fn(self, y_true, y_pred):
        # 使用二元交叉熵损失函数
        return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

# 示例数据
X = np.array([
    [1, 2],
    [2, 3],
    [3, 1],
    [6, 5],
    [7, 7],
    [8, 6]
y = np.array([0, 0, 0, 1, 1, 1])

# 创建梯度提升分类器实例
gbc = GradientBoostingClassifier(n_estimators=10, learning_rate=0.1)

# 训练模型
gbc.fit(X, y)

# 进行预测
predictions = gbc.predict(X)

# 打印预测值和真实值
print("Predictions:", predictions)
print("Real values:", y)



