机器学习工程师 - Udacity 项目 0: 预测你的下一道世界料理

第一步. 下载并导入数据

1.1 数据集:https://www.kaggle.com/c/whats-cooking/data

1.2 加载数据

# 导入依赖库
import json
import codecs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# 加载数据集
train_filename='train.json'
train_content = pd.read_json(codecs.open(train_filename, mode='r', encoding='utf-8'))

test_filename = 'test.json'
test_content = pd.read_json(codecs.open(test_filename, mode='r', encoding='utf-8'))
    
# 打印加载的数据集数量
print("菜名数据集一共包含 {} 训练数据 和 {} 测试样例。\n".format(len(train_content), len(test_content)))
if len(train_content)==39774 and len(test_content)==9944:
    print("数据成功载入!")
else:
    print("数据载入有问题,请检查文件路径!")

菜名数据集一共包含 39774 训练数据 和 9944 测试样例。
数据成功载入!

1.3 数据预览
为了查看我们的数据集的分布和菜品总共的种类,我们打印出部分数据样例。

pd.set_option('display.max_colwidth',120)

编程练习
你需要通过head()函数来预览训练集train_content数据。(输出前5条)

### TODO:打印train_content中前5个数据样例以预览数据
print(train_content.head())

cuisine id \
0 greek 10259
1 southern_us 25693
2 filipino 20130
3 indian 22213
4 indian 13162

ingredients
0 [romaine lettuce, black olives, grape tomatoes, garlic, pepper, purple onion, seasoning, garbanzo beans, feta cheese...
1 [plain flour, ground pepper, salt, tomatoes, ground black pepper, thyme, eggs, green tomatoes, yellow corn meal, mil...
2 [eggs, pepper, salt, mayonaise, cooking oil, green chilies, grilled chicken breasts, garlic powder, yellow onion, so...
3 [water, vegetable oil, wheat, salt]
4 [black pepper, shallots, cornflour, cayenne pepper, onions, garlic paste, milk, butter, salt, lemon juice, water, ch...

## 查看总共菜品分类
categories=np.unique(train_content['cuisine'])
print("一共包含 {} 种菜品,分别是:\n{}".format(len(categories),categories))

一共包含 20 种菜品,分别是:
['brazilian' 'british' 'cajun_creole' 'chinese' 'filipino' 'french' 'greek'
'indian' 'irish' 'italian' 'jamaican' 'japanese' 'korean' 'mexican'
'moroccan' 'russian' 'southern_us' 'spanish' 'thai' 'vietnamese']

 

第二步. 分析数据
由于这个项目的最终目标是建立一个预测世界菜系的模型,我们需要将数据集分为特征(Features)和目标变量(Target Variables)。

特征: 'ingredients',给我们提供了每个菜品所包含的佐料名称。
目标变量:'cuisine',是我们希望预测的菜系分类。
他们分别被存在 train_ingredients 和 train_targets 两个变量名中。

编程练习:数据提取
将train_content中的ingredients赋值到train_integredients
将train_content中的cuisine赋值到train_targets

### TODO:将特征与目标变量分别赋值
train_ingredients = train_content['ingredients']
train_targets = train_content['cuisine']

### TODO: 打印结果,检查是否正确赋值
print(train_ingredients)
print(train_targets)

0 [romaine lettuce, black olives, grape tomatoes, garlic, pepper, purple onion, seasoning, garbanzo beans, feta cheese...
1 [plain flour, ground pepper, salt, tomatoes, ground black pepper, thyme, eggs, green tomatoes, yellow corn meal, mil...
2 [eggs, pepper, salt, mayonaise, cooking oil, green chilies, grilled chicken breasts, garlic powder, yellow onion, so...
3 [water, vegetable oil, wheat, salt]
4 [black pepper, shallots, cornflour, cayenne pepper, onions, garlic paste, milk, butter, salt, lemon juice, water, ch...
5 [plain flour, sugar, butter, eggs, fresh ginger root, salt, ground cinnamon, milk, vanilla extract, ground ginger, p...
6 [olive oil, salt, medium shrimp, pepper, garlic, chopped cilantro, jalapeno chilies, flat leaf parsley, skirt steak,...
7 [sugar, pistachio nuts, white almond bark, flour, vanilla extract, olive oil, almond extract, eggs, baking powder, d...
8 [olive oil, purple onion, fresh pineapple, pork, poblano peppers, corn tortillas, cheddar cheese, ground black peppe...
9 [chopped tomatoes, fresh basil, garlic, extra-virgin olive oil, kosher salt, flat leaf parsley]
10 [pimentos, sweet pepper, dried oregano, olive oil, garlic, sharp cheddar cheese, pepper, swiss cheese, provolone che...
11 [low sodium soy sauce, fresh ginger, dry mustard, green beans, white pepper, sesame oil, scallions, canola oil, suga...
12 [Italian parsley leaves, walnuts, hot red pepper flakes, extra-virgin olive oil, fresh lemon juice, trout fillet, ga...
13 [ground cinnamon, fresh cilantro, chili powder, ground coriander, kosher salt, ground black pepper, garlic, plum tom...
14 [fresh parmesan cheese, butter, all-purpose flour, fat free less sodium chicken broth, chopped fresh chives, gruyere...
15 [tumeric, vegetable stock, tomatoes, garam masala, naan, red lentils, red chili peppers, onions, spinach, sweet pota...
16 [greek yogurt, lemon curd, confectioners sugar, raspberries]
17 [italian seasoning, broiler-fryer chicken, mayonaise, zesty italian dressing]
18 [sugar, hot chili, asian fish sauce, lime juice]
19 [soy sauce, vegetable oil, red bell pepper, chicken broth, yellow squash, garlic chili sauce, sliced green onions, b...
20 [pork loin, roasted peanuts, chopped cilantro fresh, hoisin sauce, creamy peanut butter, chopped fresh mint, thai ba...
21 [roma tomatoes, kosher salt, purple onion, jalapeno chilies, lime, chopped cilantro]
22 [low-fat mayonnaise, pepper, salt, baking potatoes, eggs, spicy brown mustard]
23 [sesame seeds, red pepper, yellow peppers, water, extra firm tofu, broccoli, soy sauce, orange bell pepper, arrowroo...
24 [marinara sauce, flat leaf parsley, olive oil, linguine, capers, crushed red pepper flakes, olives, lemon zest, garlic]
25 [sugar, lo mein noodles, salt, chicken broth, light soy sauce, flank steak, beansprouts, dried black mushrooms, pepp...
26 [herbs, lemon juice, fresh tomatoes, paprika, mango, stock, chile pepper, onions, red chili peppers, oil]
27 [ground black pepper, butter, sliced mushrooms, sherry, salt, grated parmesan cheese, heavy cream, spaghetti, chicke...
28 [green bell pepper, egg roll wrappers, sweet and sour sauce, corn starch, molasses, vegetable oil, oil, soy sauce, s...
29 [flour tortillas, cheese, breakfast sausages, large eggs]
...
39744 [extra-virgin olive oil, oregano, potatoes, garlic cloves, pepper, salt, yellow mustard, fresh lemon juice]
39745 [quinoa, extra-virgin olive oil, fresh thyme leaves, scallion greens]
39746 [clove, bay leaves, ginger, chopped cilantro, ground turmeric, white onion, cinnamon, cardamom pods, serrano chile, ...
39747 [water, sugar, grated lemon zest, butter, pitted date, blanched almonds]
39748 [sea salt, pizza doughs, all-purpose flour, cornmeal, extra-virgin olive oil, shredded mozzarella cheese, kosher sal...
39749 [kosher salt, minced onion, tortilla chips, sugar, tomato juice, cilantro leaves, avocado, lime juice, roma tomatoes...
39750 [ground black pepper, chicken breasts, salsa, cheddar cheese, pepper jack, heavy cream, red enchilada sauce, unsalte...
39751 [olive oil, cayenne pepper, chopped cilantro fresh, boneless chicken skinless thigh, fine sea salt, low salt chicken...
39752 [self rising flour, milk, white sugar, butter, peaches in light syrup]
39753 [rosemary sprigs, lemon zest, garlic cloves, ground black pepper, vegetable broth, fresh basil leaves, minced garlic...
39754 [jasmine rice, bay leaves, sticky rice, rotisserie chicken, chopped cilantro, large eggs, vegetable oil, yellow onio...
39755 [mint leaves, cilantro leaves, ghee, tomatoes, cinnamon, oil, basmati rice, garlic paste, salt, coconut milk, clove,...
39756 [vegetable oil, cinnamon sticks, water, all-purpose flour, piloncillo, salt, orange zest, baking powder, hot water]
39757 [red bell pepper, garlic cloves, extra-virgin olive oil, feta cheese crumbles]
39758 [milk, salt, ground cayenne pepper, ground lamb, ground cinnamon, ground black pepper, pomegranate, chopped fresh mi...
39759 [red chili peppers, sea salt, onions, water, chilli bean sauce, caster sugar, garlic, white vinegar, chili oil, cucu...
39760 [butter, large eggs, cornmeal, baking powder, boiling water, milk, salt]
39761 [honey, chicken breast halves, cilantro leaves, carrots, soy sauce, Sriracha, wonton wrappers, freshly ground pepper...
39762 [curry powder, salt, chicken, water, vegetable oil, basmati rice, eggs, finely chopped onion, lemon juice, pepper, m...
39763 [fettuccine pasta, low-fat cream cheese, garlic, nonfat evaporated milk, grated parmesan cheese, corn starch, nonfat...
39764 [chili powder, worcestershire sauce, celery, red kidney beans, lean ground beef, stewed tomatoes, dried parsley, pep...
39765 [coconut, unsweetened coconut milk, mint leaves, plain yogurt]
39766 [rutabaga, ham, thick-cut bacon, potatoes, fresh parsley, salt, onions, pepper, carrots, pork sausages]
39767 [low-fat sour cream, grated parmesan cheese, salt, dried oregano, low-fat cottage cheese, butter, onions, olive oil,...
39768 [shredded cheddar cheese, crushed cheese crackers, cheddar cheese soup, cream of chicken soup, hot sauce, diced gree...
39769 [light brown sugar, granulated sugar, butter, warm water, large eggs, all-purpose flour, whole wheat flour, cooking ...
39770 [KRAFT Zesty Italian Dressing, purple onion, broccoli florets, rotini, pitted black olives, Kraft Grated Parmesan Ch...
39771 [eggs, citrus fruit, raisins, sourdough starter, flour, hot tea, sugar, ground nutmeg, salt, ground cinnamon, milk, ...
39772 [boneless chicken skinless thigh, minced garlic, steamed white rice, baking powder, corn starch, dark soy sauce, kos...
39773 [green chile, jalapeno chilies, onions, ground black pepper, salt, chopped cilantro fresh, green bell pepper, garlic...
Name: ingredients, Length: 39774, dtype: object
0 greek
1 southern_us
2 filipino
3 indian
4 indian
5 jamaican
6 spanish
7 italian
8 mexican
9 italian
10 italian
11 chinese
12 italian
13 mexican
14 italian
15 indian
16 british
17 italian
18 thai
19 vietnamese
20 thai
21 mexican
22 southern_us
23 chinese
24 italian
25 chinese
26 cajun_creole
27 italian
28 chinese
29 mexican
...
39744 greek
39745 spanish
39746 indian
39747 moroccan
39748 italian
39749 mexican
39750 mexican
39751 moroccan
39752 southern_us
39753 italian
39754 vietnamese
39755 indian
39756 mexican
39757 greek
39758 greek
39759 korean
39760 southern_us
39761 chinese
39762 indian
39763 italian
39764 mexican
39765 indian
39766 irish
39767 italian
39768 mexican
39769 irish
39770 italian
39771 irish
39772 chinese
39773 mexican
Name: cuisine, Length: 39774, dtype: object

编程练习:基础统计运算
使用最频繁的佐料前10分别有哪些?
意大利菜中最常见的10个佐料有哪些?

## TODO: 统计佐料出现次数,并赋值到sum_ingredients字典中
m = []
for i in range(len(train_ingredients)):
      m += train_ingredients[i]
sum_ingredients = pd.Series(m).value_counts().to_dict()

or:

from collections import defaultdict
sum_ingredients = defaultdict(int)
for row in train_ingredients:
    for item in row:
        sum_ingredients[item] += 1
sum_ingredients = dict(sum_ingredients)
# Finally, plot the 10 most used ingredients
plt.style.use(u'ggplot')
fig = pd.DataFrame(sum_ingredients, index=[0]).transpose()[0].sort_values(ascending=False, inplace=False)[:10].plot(kind='barh')
fig.invert_yaxis()
fig = fig.get_figure()
fig.tight_layout()

## TODO: 统计意大利菜系中佐料出现次数,并赋值到italian_ingredients字典中
list_italian = train_content.loc[train_content['cuisine'].isin(['italian'])]['ingredients'].reset_index(drop=True)
n = []
for j in range(len(list_italian)):
    n += list_italian[j]
italian_ingredients = pd.Series(n).value_counts().to_dict()

or:

cuisine_ingredients = zip(train_targets, train_ingredients)
for cuisine, ingredients in cuisine_ingredients:
    if cuisine == 'italian':
        for item in ingredients:
            if item in italian_ingredients:
                italian_ingredients[item] += 1
            else:
                italian_ingredients[item] = 1

 

第三步. 建立模型

3.1 单词清洗
由于菜品包含的佐料众多,同一种佐料也可能有单复数、时态等变化,为了去除这类差异,我们考虑将ingredients 进行过滤

import re
from nltk.stem import WordNetLemmatizer
import numpy as np

def text_clean(ingredients):
    #去除单词的标点符号,只保留 a..z A...Z的单词字符
    ingredients= np.array(ingredients).tolist()
    print("菜品佐料:\n{}".format(ingredients[9]))
    ingredients=[[re.sub('[^A-Za-z]', ' ', word) for word in component]for component in ingredients]
    print("去除标点符号之后的结果:\n{}".format(ingredients[9]))

    # 去除单词的单复数,时态,只保留单词的词干
    lemma=WordNetLemmatizer()
    ingredients=[" ".join([ " ".join([lemma.lemmatize(w) for w in words.split(" ")]) for words in component])  for component in ingredients]
    print("去除时态和单复数之后的结果:\n{}".format(ingredients[9]))
    return ingredients

print("\n处理训练集...")
train_ingredients = text_clean(train_content['ingredients'])
print("\n处理测试集...")
test_ingredients = text_clean(test_content['ingredients'])

处理训练集...
菜品佐料:
['chopped tomatoes', 'fresh basil', 'garlic', 'extra-virgin olive oil', 'kosher salt', 'flat leaf parsley']
去除标点符号之后的结果:
['chopped tomatoes', 'fresh basil', 'garlic', 'extra virgin olive oil', 'kosher salt', 'flat leaf parsley']
去除时态和单复数之后的结果:
chopped tomato fresh basil garlic extra virgin olive oil kosher salt flat leaf parsley

处理测试集...
菜品佐料:
['eggs', 'cherries', 'dates', 'dark muscovado sugar', 'ground cinnamon', 'mixed spice', 'cake', 'vanilla extract', 'self raising flour', 'sultana', 'rum', 'raisins', 'prunes', 'glace cherries', 'butter', 'port']
去除标点符号之后的结果:
['eggs', 'cherries', 'dates', 'dark muscovado sugar', 'ground cinnamon', 'mixed spice', 'cake', 'vanilla extract', 'self raising flour', 'sultana', 'rum', 'raisins', 'prunes', 'glace cherries', 'butter', 'port']
去除时态和单复数之后的结果:
egg cherry date dark muscovado sugar ground cinnamon mixed spice cake vanilla extract self raising flour sultana rum raisin prune glace cherry butter port

3.2 特征提取
在该步骤中,我们将菜品的佐料转换成数值特征向量。考虑到绝大多数菜中都包含salt, water, sugar, butter等,采用one-hot的方法提取的向量将不能很好的对菜系作出区分。我们将考虑按照佐料出现的次数对佐料做一定的加权,即:佐料出现次数越多,佐料的区分性就越低。我们采用的特征为TF-IDF,相关介绍内容可以参考:TF-IDF与余弦相似性的应用(一):自动提取关键词

from sklearn.feature_extraction.text import TfidfVectorizer
# 将佐料转换成特征向量

# 处理 训练集
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 1),
                analyzer='word', max_df=.57, binary=False,
                token_pattern=r"\w+",sublinear_tf=False)
train_tfidf = vectorizer.fit_transform(train_ingredients).todense()

## 处理 测试集
test_tfidf = vectorizer.transform(test_ingredients)
train_targets=np.array(train_content['cuisine']).tolist()
train_targets[:10]

['greek',
'southern_us',
'filipino',
'indian',
'indian',
'jamaican',
'spanish',
'italian',
'mexican',
'italian']

编程练习
这里我们为了防止前面步骤中累积的错误,导致以下步骤无法正常运行。我们在此检查处理完的实验数据是否正确,请打印train_tfidf和train_targets中前五个数据。

# 你需要通过head()函数来预览训练集train_tfidf,train_targets数据
print(train_tfidf[:5])
print(train_targets[:5])

[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
['greek', 'southern_us', 'filipino', 'indian', 'indian']

3.3 验证集划分
为了在实验中大致估计模型的精确度我们将从原本的train_ingredients 划分出 20% 的数据用作valid_ingredients。

编程练习:数据分割与重排
调用train_test_split函数将训练集划分为新的训练集和验证集,便于之后的模型精度观测。

从sklearn.model_selection中导入train_test_split
将train_tfidf和train_targets作为train_test_split的输入变量
设置test_size为0.2,划分出20%的验证集,80%的数据留作新的训练集。
设置random_state随机种子,以确保每一次运行都可以得到相同划分的结果。(随机种子固定,生成的随机序列就是确定的)

### TODO:划分出验证集
from sklearn.model_selection import train_test_split
X_train , X_valid , y_train, y_valid = train_test_split(train_tfidf, train_targets, test_size = 0.2, random_state=0)

3.2 建立模型
调用 sklearn 中的逻辑回归模型(Logistic Regression)。

编程练习:训练模型

从sklearn.linear_model导入LogisticRegression
从sklearn.model_selection导入GridSearchCV, 参数自动搜索,只要把参数输进去,就能给出最优的结果和参数,这个方法适合小数据集。
定义parameters变量:为C参数创造一个字典,它的值是从1至10的数组;
定义classifier变量: 使用导入的LogisticRegression创建一个分类函数;
定义grid变量: 使用导入的GridSearchCV创建一个网格搜索对象;将变量'classifier', 'parameters'作为参数传至这个对象构造函数中;

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

## TODO: 建立逻辑回归模型
parameters = {'C':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
classifier = LogisticRegression()
grid = GridSearchCV(classifier, parameters)

grid = grid.fit(X_train, y_train)

模型训练结束之后,我们计算模型在验证集X_valid上预测结果,并计算模型的预测精度(与y_valid逐个比较)。

from sklearn.metrics import accuracy_score ## 计算模型的准确率

valid_predict = grid.predict(X_valid)
valid_score=accuracy_score(y_valid,valid_predict)

print("验证集上的得分为:{}".format(valid_score))

验证集上的得分为:0.7967316153362665

 

第四步. 模型预测(可选)

4.1 预测测试集

编程练习
将模型grid对测试集test_tfidf做预测,然后查看预测结果。

### TODO:预测测试结果
predictions = grid.predict(test_tfidf)

print("预测的测试集个数为:{}".format(len(predictions)))
test_content['cuisine']=predictions
test_content.head(10)

预测的测试集个数为:9944

4.2 提交结果

## 加载结果格式
submit_frame = pd.read_csv("sample_submission.csv")
## 保存结果
result = pd.merge(submit_frame, test_content, on="id", how='left')
result = result.rename(index=str, columns={"cuisine_y": "cuisine"})
test_result_name = "tfidf_cuisine_test.csv"
result[['id','cuisine']].to_csv(test_result_name,index=False)

 

posted on 2018-11-08 21:06  paulonetwo  阅读(1230)  评论(0编辑  收藏  举报

导航