Data Cleaning 4

1. Read the data:

　　1.1 If the data is not in .csv file. We have to search for the special read method

　　all_survey = pandas.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252') # read http://kunststube.net/encoding/ for the introduction of encoding.

　　1.2 Read a big set of data, So we are using for loop to read through the data.　　

　　for f in data_files:
　　file = pd.read_csv("schools/{0}".format(f)) #When it related to a variable in the " ",we can not directly use the variable name in the string.
　　f = f.replace(".csv","")
　　data[f] = file

　　1.3 Combine some dataframe into one by using concat() function.

　　survey = pd.concat([all_survey,d75_survey],axis = 0)

2. Cleaning up the data:

　　In the combined dataframe, it is inavoidable to have lots of 'NaN' inside. So we need to deal with these "NaN"

　　2.1 We need to figure out which column are relevant. And extract them from the original Dataframe.

　　2.2 Some of the column name may have different column name but shows the same content. We need to change them into one.

　　2.3 To unify the string, we can add,minus, change, numeric the column names

3. Filting the data:

　　3.1 We can use findall() and split function to extract certain string we need from the whole string.

　　def extract_lat(data):
　　lat_lon = re.findall("\(.+,.+\)",data)
　　lat = re.split(",",lat_lon[0])
　　final_lat = lat[0].replace("(","")
　　return final_lat　

　　data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(extract_lat) #loop through each row of the DataFrame in certain column to call the function.

　　3.2 Find the relevant dataset from each column. And store them into another Dataframe.

4. Combining the data

　　4.1 Sometimes we would like to get the unique categorize for each column. Otherwise it is difficult to categorize. So we are going to groupup each column and calculate the mean .　

　　import numpy as np
　　group_by = class_size.groupby('DBN') # Groupby function can groupup the same categoize together.
　　class_size = group_by.aggregate(np.mean) #aggregate function can operate the groupuped rows.
　　class_size.reset_index(inplace = True)
　　data['class_size'] = class_size
　　print(data['class_size'].head(5))

5. http://boundingbox.klokantech.com/ For looking for the coordiates of a city.

posted on 2016-10-25 03:00 阿难1020 阅读(105) 评论(0) 编辑收藏举报