前言

本文是笔者“机器学习与人工智能”的一个作业题，题目是按照要求采用决策树是实现男女的分类，整体来说是非常简单的，考虑老师要求以书面的形式上交，所以就进行了简单的分析并撰写的此文章。

数据集分析

首先给出了本题目数据集，如图1所示，主要有头发、声音、脸型和肤质四种属性，这四种属性均有两种可能，即（长，短）、（粗，细）、（方，圆）和（粗糙，细腻），样本一共为10个，其中男生样本有5个，女生样本有5个。

决策树实现过程

由于样本有多个属性，具体按照那个属性进行划分呢？

其实划分是有一定原则的，就是决策树的分支点所包含的样本应该尽可能的属于一分类，即样本的纯度越来越高就好。由此我们可以想到有信息论所学习的信息熵，熵是用于衡量混乱程度的，信息熵越大，说明不确定性越大，所含的信息就是越丰富的，相反样本的纯度越高，信息熵的值就越低。假设样本集中第类样本占的比例为（k=( ), 为类别数），则的信息熵为:

其中越小，则纯度越低，我们以上文男女分类的数据集为例，已知有两类，男生样本有5个，女生样本有5个，则可以计算样本集的信息熵为：

除了样本信息熵之外，还有一个需要介绍的概念是信息增益，使用样本属性对样本集进行划分所获得的“信息增益”的计算方法是，用样本集的总信息熵减去属性的每个分支的信息熵与权重（该分支的样本数除以总样本数）的乘积，通常，信息增益越大，意味着用属性进行划分所获得的“纯度提升”越大。因此，优先选择信息增益最大的属性来划分。

为表述方便，假设头发、声音、脸型和肤质四种属性分别是、、和，针对竖属性为头发来讲，的女生有1个，男生3个；的女生有4个，男生2个。所以可得：

则最终的信息增益

程序实现

import pandas as pd
import numpy as np

#计算信息熵
def cal_information_entropy(data):
    data_label = data.iloc[:,-1]
    label_class =data_label.value_counts() #总共有多少类
    Ent = 0
    for k in label_class.keys():
        p_k = label_class[k]/len(data_label)
        Ent += -p_k*np.log2(p_k)
    return Ent

#计算给定数据属性a的信息增益
def cal_information_gain(data, a):
    Ent = cal_information_entropy(data)
    feature_class = data[a].value_counts() #特征有多少种可能
    gain = 0
    for v in feature_class.keys():
        weight = feature_class[v]/data.shape[0]
        Ent_v = cal_information_entropy(data.loc[data[a] == v])
        gain += weight*Ent_v
    return Ent - gain

#获取标签最多的那一类
def get_most_label(data):
    data_label = data.iloc[:,-1]
    label_sort = data_label.value_counts(sort=True)
    return label_sort.keys()[0]

#挑选最优特征，即信息增益最大的特征
def get_best_feature(data):
    features = data.columns[:-1]
    res = {}
    for a in features:
        temp = cal_information_gain(data, a)
        res[a] = temp
    res = sorted(res.items(),key=lambda x:x[1],reverse=True)
    return res[0][0]

##将数据转化为（属性值：数据）的元组形式返回，并删除之前的特征列
def drop_exist_feature(data, best_feature):
    attr = pd.unique(data[best_feature])
    new_data = [(nd, data[data[best_feature] == nd]) for nd in attr]
    new_data = [(n[0], n[1].drop([best_feature], axis=1)) for n in new_data]
    return new_data

#创建决策树
def create_tree(data):
    data_label = data.iloc[:,-1]
    if len(data_label.value_counts()) == 1: #只有一类
        return data_label.values[0]
    if all(len(data[i].value_counts()) == 1 for i in data.iloc[:,:-1].columns): #所有数据的特征值一样，选样本最多的类作为分类结果
        return get_most_label(data)
    best_feature = get_best_feature(data) #根据信息增益得到的最优划分特征
    Tree = {best_feature:{}} #用字典形式存储决策树
    exist_vals = pd.unique(data[best_feature]) #当前数据下最佳特征的取值
    if len(exist_vals) != len(column_count[best_feature]): #如果特征的取值相比于原来的少了
        no_exist_attr = set(column_count[best_feature]) - set(exist_vals) #少的那些特征
        for no_feat in no_exist_attr:
            Tree[best_feature][no_feat] = get_most_label(data) #缺失的特征分类为当前类别最多的

    for item in drop_exist_feature(data,best_feature): #根据特征值的不同递归创建决策树
        Tree[best_feature][item[0]] = create_tree(item[1])
    return Tree

def predict(Tree , test_data):
    first_feature = list(Tree.keys())[0]
    second_dict = Tree[first_feature]
    input_first = test_data.get(first_feature)
    input_value = second_dict[input_first]
    if isinstance(input_value , dict): #判断分支还是不是字典
        class_label = predict(input_value, test_data)
    else:
        class_label = input_value
    return class_label
    
    
data = pd.read_csv('data_word.csv',encoding='gbk')
#读取数据

#统计每个特征的取值情况作为全局变量
column_count = dict([(ds, list(pd.unique(data[ds]))) for ds in data.iloc[:, :-1].columns])

#创建决策树
dicision_Tree = create_tree(data)
print(dicision_Tree)

test_data_1 = {'头发':'长','声音':'粗','脸型':'方','肤质':'粗糙'}
test_data_2 = {'头发':'短','声音':'粗','脸型':'圆','肤质':'细腻'}
result = predict(dicision_Tree,test_data_2)
print('分类结果为'+'男生'if result == 1 else '女生')

参考

https://blog.csdn.net/IT23131/article/details/121068259 https://zhuanlan.zhihu.com/p/499238588