Data Cleaning 1

1. Read mutiple data files;

　　import pandas as pd

　　data_files = [
　　"ap_2010.csv",
　　"class_size.csv",
　　"demographics.csv",
　　"graduation.csv",
　　"hs_directory.csv",
　　"sat_results.csv"
　　]

　　data = {}

　　for f in data_files:
　　file = pd.read_csv("schools/{0}".format(f)) #Format string syntax
　　f = f.replace(".csv","")#Delete all the .csv and save as file name
　　data[f] = file

2. Read .txt file and combine function:

　　all_survey = pd.read_csv("schools/survey_all.txt",delimiter = "\t", encoding = "windows-1252") #what is the meaning of delimiter and encoding?
　　d75_survey = pd.read_csv("schools/survey_d75.txt",delimiter = "\t", encoding = "windows-1252")
　　survey = pd.concat([all_survey,d75_survey],axis = 0) #combine function

3. apply() function

　　data['hs_directory']["DBN"] = data['hs_directory']["dbn"] #to represent a column of data in the set of dataset.

　　def data_change(data):
　　　　if len(str(data)) == 1:
　　　　　　return '0'+str(data)
　　　　else:
　　　　　　return str(data)

　　data['class_size']['padded_csd'] = data["class_size"]["CSD"].apply(data_change) # Most of time, apply can replace some of for loop. If there is only one parameter in a function. When we use the apply function, we only need to put the function name into the argument of the apply function.

　　data['class_size']['DBN'] = data['class_size']['padded_csd'] + data['class_size']['SCHOOL CODE'] # different type of data can not be added up together.

4. to_numeric function

　　data['sat_results']["SAT Math Avg. Score"] = pd.to_numeric(data["sat_results"]['SAT Math Avg. Score'],errors = "coerce") # convert the data to number

5. Regular Expression　

　　import re

　　def extract_lat(data):
　　lat_lon = re.findall("\(.+,.+\)",data) #use findall function to find all the data which matchs the pattern. "\" means that we can use notation as normal notation. "." means all types of data except line change. '+' means there are unlimited types of "." before the next regular expression.
　　lat = re.split(",",lat_lon[0]) # Because findall function return a list of datasets. So we use lat_lon[0] to get the first data in the list.
　　final_lat = lat[0].replace("(","")
　　return final_lat

　　data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(extract_lat)
　　print(data["hs_directory"].loc[0,"Location 1"])

Conclusion:

In this section, first we read mutiple data files and transfer them into one dictionary. Then we also read some .txt files and extract useful data and combine them into one DataFrame by using pd.concat(). After all, we combine all the data together.. At last, we extract all the useful data by using regular expression.

posted on 2016-10-19 07:58 阿难1020 阅读(157) 评论(0) 编辑收藏举报