Preparing Data for SQL:

Sometimes we would like to stroe data into SQL server. However , the dataset need to be cleaned before it is sent. So here we use pandas to deal with dataset(.csv)file.

1. read_csv, set encoding:

  file = pd.read_csv("academy_awards.csv",encoding = 'ISO-8859-1')

2. Use str function to read first 4 letters of all the strings in the column.

  file["Year"] = file["Year"].str[0:4]

3. Use .isin function to get the target rows I need:

  award_categories = ["Actor -- Leading Role","Actor -- Supporting Role",'Actress -- Leading Role','Actress -- Supporting Role']

  nominations = later_than_2000[later_than_2000.isin(award_categories)['Category']]

4. Use .map() function to replace all the element in the column as I need:

  won_dic = {
  'NO' : 0,
  'YES': 1
  }
  nominations.is_copy = False #Attention, here we can not directly modify the copied dataframe, we have to run this line to make copied dataframe changable.
  nominations["Won?"] = nominations["Won?"].map(won_dic)

5. Use .drop() function to get rid of columns I do not need:

  final_nominations = nominations.drop(delete_list,axis = 1)

6. Use vectorized string method to modify each string in a column in the dataframe:

  additional_info_one = final_nominations["Additional Info"].str.rstrip("'}") #rstrip is to get rid of all the strings on the right side of the target string in the bracket.
  additional_info_two = additional_info_one.str.split("{.")  
  movie_names = additional_info_two.str[0]
  characters = additional_info_two.str[1]

7.  Use to_sql request to save the dataset into the sql:

  final_nominations.to_sql("nominations",conn,index = False)

posted on 2016-11-24 05:31  阿难1020  阅读(113)  评论(0编辑  收藏  举报