CMI-PIU: Features EDA

【理解任务】Understanding the task


The aim of this competition is to predict the Severity Impairment Index (sii), which measures the level of problematic internet use among children and adolescents, based on physical activity data and other features.
本次比赛的目的是预测严重性损伤指数 (sii),该指数根据身体活动数据和其他特征衡量儿童和青少年有问题的互联网使用水平。

sii is derived from PCIAT-PCIAT_Total, the sum of scores from the Parent-Child Internet Addiction Test (PCIAT: 20 questions, scored 0-5).
sii 源自 PCIAT-PCIAT_Total,即亲子网络成瘾测试(PCIAT:20 道题,得分 0-5)的分数总和。


Target Variable (sii) is defined as:
目标变量 (sii) 定义为:

  • 0: None (PCIAT-PCIAT_Total from 0 to 30)
    0:无(PCIAT-PCIAT_Total 从 0 到 30)

  • 1: Mild (PCIAT-PCIAT_Total from 31 to 49)
    1:轻度(PCIAT-PCIAT_Total 从 31 到 49)

  • 2: Moderate (PCIAT-PCIAT_Total from 50 to 79)
    2:中等(PCIAT-PCIAT_Total 从 50 到 79)

  • 3: Severe (PCIAT-PCIAT_Total 80 and more)
    3:重度(PCIAT-PCIAT_Total 80 及以上)
    This makes sii an ordinal categorical variable with four levels, where the order of categories is meaningful.
    这使得 sii 成为具有四个级别的有序分类变量,其中类别的顺序有意义。


Type of Machine Learning Problem we can use with sii as a target:
我们可以将 sii 作为目标使用的机器学习问题类型:

  • Ordinal classification (ordinal logistic regression, models with custom ordinal loss functions)
    序数分类(序数 Logistic 回归,具有自定义序数损失函数的模型)

  • Multiclass classification (treat sii as a nominal categorical variable without considering the order)
    多类分类(将 sii 视为名义型分类变量,而不考虑顺序)

  • Regression (ignore the discrete nature of categories and treat sii as a continuous variable, then round prediction)
    回归(忽略类别的离散性,将 sii 视为连续变量,然后进行四舍五入预测)

  • Custom (e.g. loss functions that penalize errors based on the distance between categories)
    自定义(例如,根据类别之间的距离对错误进行惩罚的损失函数)
    We can also use PCIAT-PCIAT_Total as a continuous target variable, and implement regression on PCIAT-PCIAT_Total and then map predictions to sii categories.
    我们还可以使用 PCIAT-PCIAT_Total 作为连续目标变量,并在 PCIAT-PCIAT_Total 上实施回归,然后将预测映射到 sii 类别。


Finally, another strategy involves predicting responses to each question of the Parent-Child Internet Addiction Test: i.e. pedict individual question scores as separate targets, sum the predicted scores to get the PCIAT-PCIAT_Total and map predictions to the corresponding sii category.
最后,另一种策略涉及预测对亲子网络成瘾测试中每个问题的回答:即将单个问题分数作为单独的目标,将预测分数相加以获得 PCIAT-PCIAT_Total并将预测映射到相应的 sii 类别。

But first, let's make some exploratory data analysis.
但首先,让我们进行一些探索性数据分析。


【数据预览】Data Preview

train = pd.read_csv('/kaggle/input/child-mind-institute-problematic-internet-use/train.csv')
test = pd.read_csv('/kaggle/input/child-mind-institute-problematic-internet-use/test.csv')
data_dict = pd.read_csv('/kaggle/input/child-mind-institute-problematic-internet-use/data_dictionary.csv')

【训练数据】Train data

display(train.head())
print(f"Train shape: {train.shape}")

【测试数据】Test data

display(test.head())
print(f"Test shape: {test.shape}")

【数据字典】Data dictionary

data_dict.head()

【帮助程序函数】Helper functions

def calculate_stats(data, columns):
    if isinstance(columns, str):
        columns = [columns]

    stats = []
    for col in columns:
        if data[col].dtype in ['object', 'category']:
            counts = data[col].value_counts(dropna=False, sort=False)
            percents = data[col].value_counts(normalize=True, dropna=False, sort=False) * 100
            formatted = counts.astype(str) + ' (' + percents.round(2).astype(str) + '%)'
            stats_col = pd.DataFrame({'count (%)': formatted})
            stats.append(stats_col)
        else:
            stats_col = data[col].describe().to_frame().transpose()
            stats_col['missing'] = data[col].isnull().sum()
            stats_col.index.name = col
            stats.append(stats_col)

    return pd.concat(stats, axis=0)

【目标变量和 Internet 使用】Target Variables and Internet use

Let's identify the features that are related to the target variable and that are not present in the test set.
让我们确定与目标变量相关且测试集中不存在的特征。

train_cols = set(train.columns)
test_cols = set(test.columns)
columns_not_in_test = sorted(list(train_cols - test_cols))
data_dict[data_dict['Field'].isin(columns_not_in_test)]

Parent-Child Internet Addiction Test (PCIAT): contains 20 items (PCIAT-PCIAT_01 to PCIAT-PCIAT_20), each assessing a different aspect of a child's behavior related to internet use. The items are answered on a scale (from 0 to 5), and the total score provides an indication of the severity of internet addiction.
亲子网络成瘾测试 (PCIAT):包含 20 个项目(PCIAT-PCIAT_01 到 PCIAT-PCIAT_20),每个项目评估儿童与互联网使用相关的行为的不同方面。这些项目按等级(从 0 到 5)回答,总分表明网络成瘾的严重程度。
We also have season of participation in PCIAT-Season and total Score in PCIAT-PCIAT_Total; so there are 22 PCIAT test-related columns in total.
我们还有参加 PCIAT-Season 的赛季和 PCIAT-PCIAT_Total 的总分;所以总共有 22 个 PCIAT 测试相关列。
Let's verify that the PCIAT-PCIAT_Total align with the corresponding sii categories by calculating its minimum and maximum scores for each sii category:
让我们通过计算每个 sii 类别的最低和最高分数来验证 PCIAT-PCIAT_Total 是否与相应的 sii 类别一致:

CMI-PIU: Features EDA

posted @ 2025-01-10 22:35  HaibaraYuki  阅读(73)  评论(0)    收藏  举报