国际2016C

Posted on 2019-11-07 17:35  Volcano3511  阅读(228)  评论(0编辑  收藏  举报

2016 MCM Problem C The Goodgrant Challenge
Goodgrant基金会

题目大意

Goodgrant基金会是一个用于提高美国大学生本科生教育素养的慈善组织。这个组织计划在五年之内每年捐赠一亿美金,从2016年7月开始。他们不想与其他大型的组织(duplicate the investments)
你的团队被Goodgrant Foundation要求建立一个模型决定最优投资策略(optimal investment strategy),确定学校以及每个学校的投资金额,投资的回报,投资应该持续的时间,为了得到对学生表现产生巨大的积极影响的最高的可能性。
这一战略应该包含一个1到n的优化和优先(optimized and prioritized)推荐学校,你是基于每个候选学校有效利用私人资金投资潜力的证明,和估计的投资回报(ROI)以适当方式为慈善组织如goodgrant定义。

协助你的工作,附加的数据文件(problemcdata .zip)包含来自美国国家教育统计中心(www.nces提取的信息。ed.gov /美国),它保持一个广泛的几乎所有的大学在美国的调查信息数据库,和大学记分卡数据集各种制度绩效数据。你的模型和随后的策略必须基于一些有意义和防御这两个数据集的子集。

另外还要求的一页摘要MCM总结,你的报告必须包括首席财务官(CFO)Alpha Chiang先生的一封信,描述的最优投资策略,你的建模方法和主要结果,并简要地讨论你提出的投资回报(ROI),goodgrant基础应采用的在美国对2016捐款和未来的慈善教育投资评估策略。这封信应该不超过两页。

An Educational Donation Mechanism Based On Data Insight
一种基于数据洞察的教育捐赠机制

summary

Recent years, Big Data has become increasingly popular and the guidance of big data is required in many fields, including charitable field. In our paper, we construct a new ROI evaluation system for charitable organization using data mining methods to process data, and succeed in determining an optimal investment strategy for Goodgrant Foundation.
近年来,大数据已经越来越普及,在许多领域都需要大数据的指导,包括慈善领域。在本文中,我们使用数据挖掘的方法处理数据为慈善组织构造了一个新的投资回报(ROI)评估系统,,并成功地确定goodgrant基金最优投资策略。

First, we operate on the data. We do data screening according to the integrity and redundancy of the information, deleting data with information less than the threshold, and merging different attributes using linear fitting and PCA. For the reserved attributes and schools, we do data imputation to fill the missing data based on K-means Clustering. Then we normalize all the data to make them comparable in the following analysis.
首先,我们对数据进行操作。我们根据信息的完整性和冗余性做数据筛选,删除信息量小于阈值的数据,使用线性拟合主成分分析并合并不同的属性。对于需要的属性和学校,我们基于k-均值聚类数据填补填补了缺失的数据,。然后,我们正常化所有的数据,使他们在下面的分析中匹配。

Second, we construct a ROI evaluation criteria, which is a ratio of output and input multiplying an adjustment coefficient, named “urgency”. The ratio reflects the benefits in related to the cost, while urgency reflects the demand for money which is an important factor should be considered by charitable organization. We use PCA to select attributes, letting salary, education quality and some others to represent output, tuition to represent input and Federal loan, debt and some others to represent urgency. Then, AHP is used to measure the importance between different factors and allocate weights.
第二,我们构建了一个投资回报率的评价标准(evaluation criteria),这是一个输出和输入乘以一个调整系数,命名为“紧迫性”的比例。该比率反映了与成本相关的收益,紧迫性反映了对钱的需求,这是一个应被慈善组织考虑的重要的因素。我们使用主成分分析法来选择属性,让工资,教育质量和一些其他代表的输出,学费代表输入和联邦贷款,债务和一些其他代表的紧迫性。然后,层次分析法是用来衡量不同因素之间的重要性和分配权重。
共有三个主要参数,输出输入与系数,系数需要输入输出乘以系数得到

Third, we put forward two kinds of model, basic model for one year and time series model for five year. Seeing the ROI as benefits from investment, we introduce the fluctuation of output as “risk”, imitating the concept of Modern Portfolio Theory in the financial sector to solve the problems. In the basic model, we develop a Mixed Integer Linear Programming Algorithm and succeed in finding 14 schools for the investment. Further, we consider the time factor and improve the model into a time series model, using MILP and Grey Prediction to determine the long-term investment strategy. 16 schools are chosen with different time duration and different amounts of money.
第三,我们提出了两种模型,一年基本模型,五年的时间序列模型的基本模型。把投资回报(ROI)看作投资的收益,我们将产出的波动作为“风险”,模仿在金融领域现代投资组合理论的概念来解决问题。在基本模型中,我们开发了一个混合整数线性规划算法,并成功地找到14所学校进行投资。此外,我们考虑时间因素和提高模型的时间序列模型,利用灰色预测模型,以确定长期投资策略。16所学校被选择为不同的时间和不同数额的钱。

Finally, we make sensitivity analysis for our model, changing the amount of schools, the restriction for money, whether allocating the money equally or not and so on to analyze the different results of the output and to find the better parameters for ideal results.
最后,对模型进行了敏感性分析(sensitivity analysis),改变了学校的数量、资金的限制、是否平等地分配金钱等,分析了不同的输出结果,并找到了理想的结果更好的参数。
To sum up, our model is a feasible and reasonable model with technical and data support. Because of the subjectivity, this model can be used flexibly after data training.
综上所述,我们的模型是一个有技术和数据支持的可行的和合理的模型。由于主观性,该模型可以在使用数据训练后被灵活地使用。

正文部分

Introduction

1.1 Background
If it were a decade ago, you would not image that the pageview of ”Facebook” can be more than millions in one minute; you would not image that when you open ”google maps”, all the travel information is already in the palm of your hand; you would not image that through data mining you can gain an insight into the development trend of an enterprise to guide your investment. Nowadays big data has penetrated into our work and lives, and has brought such huge changes. In turn, it has become particularly important for us to find useful information from the mass of data to guide our work and life.
如果是十几年前,你不能想象“脸谱网”可以在一分钟内,有超过百万的浏览量;你不会的想象当你打开“谷歌地图”,所有的旅游信息已在手掌你的手;你不能想象通过数据挖掘可以获得了解一个企业来指导你的投资的发展趋势。如今大数据已经渗透到我们的工作和生活中,并带来了如此巨大的变化。反过来,从大量的数据中找到有用的信息来指导我们的工作和生活变得尤为重要了。

As a special column from ”New York Times” in Feb., 2012 says, Big Data Era has come. In the commercial, economic and other areas, information will increasingly be made based on data and analysis, rather than based on experience and intuition. Also the field of charity is in the same case. In the past, it is more difficult to give money away intelligently than to earn it in the first place.[1] We do not know how to do the charitable giving rationally, thus, the calculation of ROI is also out of the question. But now new and faster information could make charitable giving more effective and efficient. Moreover, it provides a possibility to link charitable giving issues with the investment issues in the financial sector.
作为一种特别专栏从“纽约时报”在2月,2012说,大数据时代已经到来。在商业、经济和其他领域,信息将越来越多地基于数据和分析,而不是基于经验和直觉。慈善事业也在同一个案例中。在过去,把聪明地给钱比赚相同地方要困难得多。[ 1 ]我们不知道如何合理地做慈善捐赠,因此,投资回报率的计算也出了问题。但现在,新的和更快的信息可以使慈善捐赠更有效和高效。此外,它还提供了一种可能性,将慈善捐赠问题与金融领域的投资问题联系起来。

This article is about a charitable donation issue of universities in the U.S. We aim to design a measure system of the return on investment based on large quantities of data through data mining methods. To solve the problem, we will use the Portfolio Theory, Linear Programming, Grey Theory and some other methods to determine the optimal strategy in terms of dimension of time.
本文是关于美国高校的慈善捐赠问题,我们的目标是设计一个基于大数据量的数据挖掘方法的投资回报率测量系统。为了解决这一问题,我们将使用组合理论线性规划灰色理论和其他一些方法来确定最佳的策略在时间维度。

1.2 Overview of Our Work
First, we find a few key points in this question :
• The volume of data is large and of different types. How to do the normalization of the data.
• Among the massive data, there are many missing data in the files, which contains less information and not easy to do batch processing.
• How to classify the large amount of attributes of universities.
• Different attributes focus on different aspects and how to judge their importance.
• How to choose schools from the candidate list and how the allocate investment amount among them.
• The charitable investment process lasts for 5 years, so time is an important factor which will influence our ROI criteria.
首先,我们在这个问题上找到了几个关键点:
数据量大且不同类型。如何做好数据的正常化。
在大量的数据中,有许多丢失的数据在文件中,其中包含较少的信息和不容易做批处理
如何对大学的大量属性进行分类。
不同的属性侧重于不同的方面,以及如何判断它们的重要性
如何从候选名单中选择学校,以及如何分配投资金额
慈善投资过程持续5年,时间是影响投资回报率标准的一个重要因素。
On the basis of above discussion, to determine the optimal investment strategy, we may boil down the tasks to the following four steps :
• First, we do data screening according to the integrity and usefulness of the information. For the retention attributes and schools, we use K-means clustering method to fill the missing data. And normalize all the data.
• Second, we use PCA method to choose and classify different attributes. Then we use AHP method and knowledge of finance to construct ROI concept. And we use the ROI concept to process data, ranking the candidate schools.
• Third, we introduce the concept of Modern Portfolio Theory and construct two models. In the basic model, we develop a Mixed Integer Linear Programming Algorithm to determine an optimal investment strategy. Further, we consider the time factor and improve the model into a timing model, using Dynamic Programming Algorithm and Grey Prediction to determine the long-term investment strategy.
• Further analysis and discussion of the model.
在上述讨论的基础上,确定最优投资策略,我们可以将任务归结为以下四个步骤:
首先,我们根据信息的完整性和有用性来进行数据筛选。对于保留属性和学校,我们使用k-均值聚类方法来填补丢失的数据。并规范所有的数据。
第二,我们使用主成分分析法(PCA)方法来选择和分类不同的属性。然后运用层次分析法和财务知识构建投资回报率概念。我们使用的投资回报率的概念来处理数据,排名候选学校。
第三,我们介绍了现代投资组合理论的概念,并构建了两个模型。在基本模型中,我们开发了一个混合整数线性规划算法,以确定最佳的投资策略。此外,我们考虑的时间因素,并提高到一个时间模型的模型,采用动态规划算法灰色预测,以确定长期投资策略。
对模型的进一步分析和讨论。

1.3 Assumptions

  • Ignore inflation and deflation of money and other time value of money. The value of money remains unchanged.
  • As a charitable organization, our aim is to improve educational performance and expect for more social benefits rather than gaining profit.
  • Our charitable donation is a comprehensive scholarship, not rewarding for the outstanding contributions in specific fields.
  • If we have different goals and strategies from other charitable organization, we believe it can reduce the possibility of duplication of investment to a great extend.
  • The object for evaluating and scholarship granting is school, not individual, though some individual information is included in our criteria. And it is business of school toallocate the money.
  • We focus on the fairness of our strategy so as to invest more schools regardless of the reputation of them.
  • Because of Marginal utility, we try not to invest large amount of money to one school.
  • In the timing model, we assume that the influence of other factors which are excluded outside the model can be ignored. That to say, future is predictable.
    1.3 假设
    -忽略通货膨胀和通货紧缩的货币和其他时间价值的货币。货币的价值保持不变。
    -作为一个慈善组织,我们的目标是提高教育表现,并期望更多的社会福利,而不是获得利润。
    -我们的慈善捐赠是一个综合性的奖学金,在特定领域的杰出贡献是不值得的。
    如果我们有不同的目标和策略,从其他慈善组织,我们相信它可以减少重复投资的可能性很大的延伸。
    评价和奖学金授予的对象是学校,而不是个人,但一些个人信息被包含在我们的标准中。这是学校分配(toallocate)的资金业务。
    我们专注于我们的策略的公平性,以便投资更多的学校,无论他们的声誉。
    由于边际效用,我们尽量不把大量的钱投资到一所学校。
    在时序模型中,我们假设被排除在模型之外的其他因素的影响可以被忽略。这说,未来是不可预测的。

2 Data Processing

2.1 Data Screening
The amount of raw data is large, so we should first do data screening according to the integrity and usefulness of the information.
First, we do data screening on 7805 schools :
• We only consider the 2978 candidate schools in the file Problem C - IPEDS UID for Potential Candidate Schools, and match the schools with their 95 attributes in the file Problem C - Most Recent Cohorts Data (Scorecard Elements).xlsx.
• Delete the schools which are not currently operating institution and which are on Heightened Cash Monitoring 2 by Department of Education, meaning that they encounter economic depression and lack students and which has no or very limited information about percentage of degrees awarded. It’s meaningless to invest on these schools.
• Delete the schools of whom 50% of the attributes are ”NULL”. If the percentage of missing data exceeds 50%, the imputation will result in great error, which we treat as a threshold for missing data.[2]
原始数据量大,因此,我们首先应根据信息的完整性和有用性来进行数据筛选
首先,我们做数据筛选的7805所学校:
•我们只考虑2978个学校 -美国(IPEDS)UID潜在候选学校(IPEDS UID for Potential Candidate Schools),和学校的95个属性(Most Recent Cohorts Data (Scorecard Elements)记分卡要素)匹配
文件问题C -最新研究的数据(记分卡要素)。xlsx。
删除目前没有操作机构的学校,并被教育部门加强现金监控2,这意味着他们遇到经济萧条和缺乏学生,并没有或非常有限的有关学位授予信息的百分比。投资这些学校是没有意义的。
删除其中50%个属性为“null”的学校。如果丢失数据的百分比超过50%,插补(imputation)将导致很大的误差,我们把它作为一个阈值丢失的数据。

首先进行数据筛选,去掉几乎没有可能的的学校——————减小数据量
以及信息量不够的学校——————使数据更合理规范

Second, we do data screening on 95 attributes :
筛选95个属性
We combine some binary variables together. For example, we delete the net price for different income classes and combine the total net price for public school and private school together, named as ”Net Price”. And we use the total retention rate for weighting to combine retention rate for full-time and part-time students together, named as ”Retention Rate”.
• Though there are large quantities of missing data in ”SAT scores” and ”ACT scores”, we reserve the midpoint of them as categorical variables. High entrance requirements with SAT or ACT scores would indicate higher education quality, which we will treat differently from the low entrance requirements school in the following ROI evaluation system.
• Besides, we reserve all the flag attributes like ”flag for Historically Black College and University”, ”flag for women-only college” and all the subject attributes like ”percentage of degrees awarded in Architecture And Related Services”, using them for the clustering and data imputation in the next step.
• We delete location, age of students and some other information contributes less to our ROI evaluation system.
我们将一些二元变量结合在一起。例如,我们删除了不同收入阶层的净价格,并结合公立学校和私立学校的总的净价格,被命名为“净价格”。我们使用总保留率加权相结合的保留率为全职和兼职的学生,被命名为“保留率”。

通过加权等方式,将属性进行合并,比如合并全职与兼职学生的保留率

虽然有大量的丢失的数据在“SAT分数”“ACT分数”,我们保留他们的中点作为分类变量。高入学要求与坐或行为分数将表明更高的教育质量,我们将不同于低入学要求学校在以下的投资回报率评估系统。
•之外,我们保留所有的标记属性一样,“历史上的黑人学院和大学”的旗帜,旗帜的女子学院”和所有的学科属性如“百分比的学位授予在建筑及相关服务,用于在下一步的聚类和数据填补。
我们删除的位置,学生的年龄和其他一些对我们的投资回报率评估系统没有作用的信息。

通过分析,减少简化属性值,便于此后建立模型时考虑不同属性

After data screening, there are about 2700 schools each with about 60 attributes in the candidate list.
数据筛选后,有2700个学校,各自有60个属性

2.2 Data imputation
数据填补
In statistics, missing data occur when no data value is stored for the variable in an observation.Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.
在统计中,丢失的数据发生时,没有数据值存储在一个观察中的变量。丢失的数据是一种常见的发生,并可以对结论有一个显着的效果。

K-均值聚类法
许多方法可用于数据归集,如列表删除,平均归集,归集回归插补和多。在这里,我们使用k-均值聚类分析:首先使用完整的和有意义的数据文件中的相似性相结合的学校,然后使用这些类似的学校的手段来填补缺失的数据。[ 2 ]。我们可以看到它作为一种改进的方法的平均插补。
给定一组观测
(x1, x2, . . . , xn)
确定K值是困难的,我们利用R2的值确定

更好的K值是类之间的平方和的比例越大,其所占的比例越小平方内类的和的比例。所以我们选择
K值依据最大R2值。
认为相同的学校之间有高度的相似性
使用没有数据缺失的学校来填充同类的学校的相应的列

需要填充的重要的数据

确定数据填充的依据

这些数据是完整的,具有重要意义,可以指导填充。他们能较好地描述学校特性。以此发现学校之间的相似性和不同。

本题只用了一次K聚类分析,没有不断重复

发现聚类数量为5是最好的,聚类结果:

接下来填充每个类中的空白。我们得到了所有学校的属性数值

K均值聚类算法是先随机选取K个对象作为初始的聚类中心。然后计算每个对象与各个种子聚类中心之间的距离,把每个对象分配给距离它最近的聚类中心。聚类中心以及分配给它们的对象就代表一个聚类。一旦全部对象都被分配了,每个聚类的聚类中心会根据聚类中现有的对象被重新计算。这个过程将不断重复直到满足某个终止条件。终止条件可以是没有(或最小数目)对象被重新分配给不同的聚类,没有(或最小数目)聚类中心再发生变化,误差平方和局部最小。

2.3 Data Normalization
在填充所有丢失的数据后,我们应该做的数据正常化,以方便进一步分析的数据。它可以提供一种方法进行比较,不同的种类的数据,并反映不同因素的综合结果。

Here we use Min-max normalization, doing linear transformation of the raw data, mapping data set
在这里,我们使用最小-最大归一化,做线性变换的原始数据,映射数据集x = { x1, x2, . . . , xn}到 [0,1]

以下数据均为标准化数据。

3 ROI Evaluation System

3.1 Concept of ROI ROI的概念
建立ROI基本公式

对三个变量具体分析

  • OUTPUT
    由四个要素组成
    Salary After Graduation, Retention Rate, Repayment Ability and Education Enhance Rate
    接下来叙述选择各个要素的原因

相关性低的四个元素都可以作为影响输出的因素,通过主成分分析法,用最好的一个数据代表与之相关性大的几个元素
其次,他们要与输出密切相关
进行具体分析
1/2/3/4
用AHP判断四个因素的重要程度
Finally, the comprehensive calculation formula is described as :
V1 · SAG RR RA EER
Output = V2 · SAG RR RA

考虑到
这四项指标的重要性评价是一种主观判断,通过主观赋权法确定的权重具有更大的科学合理性。

  • input
    只有输入将会对输出产生过大影响

Input = (1− α) + α× NP

在经济领域,只要输入和输出就行了,但是在我们这个问题上,考虑到人对捐赠的需求程度不同——需要引入纠正参数——Urgency
三个方面:Pell
Grants, Federal Loan, Debt

  1. Pell Grants 学生需求程度
  2. FL:
    3.Debt:学校的可靠程度
    So we choose “GRAD DEBT M
    DN SUPP”, median debt of completers, suppressed for n=30, to represent
    Debt

Input = (1− α) + α× NP

尽管这三个参数相似——PCA显示他们是独立的

On the basis of the discussion above, we draw a picture to visually represent our ROI
evaluation system.

3.2 Using Grey Theory to predict ROI 使用灰色预测理论预测ROI

灰色预测是一种对含有不确定因素的系统进行预测的方法。灰色预测通过鉴别系统因素之间发展趋势的相异程度,即进行关联分析,并对原始数据进行生成处理来寻找系统变动的规律,生成有较强规律性的数据序列,然后建立相应的微分方程模型,从而预测事物未来发展趋势的状况。其用等时距观测到的反应预测对象特征的一系列数量值构造灰色预测模型,预测未来某一时刻的特征量,或达到某一特征量的时间。