代码改变世界

Python Tutorial: More Advanced Data Wrangling

2017-12-12 15:18  nuswgg  阅读(182)  评论(0编辑  收藏  举报
  • Drop observations with missing information.
# Notice the use of the fish data set because it has some missing 
# observations 
fish = pd.read_csv('/Users/fish.csv')
# First sort by Weight, requesting those with NA for Weight first 
fish = fish.sort_values(by='Weight', kind='mergesort', na_position='first')
print(fish.head())
new_fish = fish.dropna()
print(new_fish.head())

pandas.DataFrame.dropna

  • Merge two data sets together on a common variable.

# Notice the use of the student data set again, however we want to reload it
# without the changes we've made previously
student = pd.read_csv('/Users/class.csv')
student1 = pd.concat([student["Name"], student["Sex"], student["Age"]],
axis = 1)
print(student1.head())

a) First, select specific columns of a data set to create two smaller data sets.

# Notice the use of the student data set again, however we want to reload it
# without the changes we've made previously
student = pd.read_csv('/Users/class.csv')
student1 = pd.concat([student["Name"], student["Sex"], student["Age"]],
axis = 1)
print(student1.head())

 

student2 = pd.concat([student["Name"], student["Height"], student["Weight"]], axis = 1)
print(student2.head())

b) Second, we want to merge the two smaller data sets on the common variable.

new = pd.merge(student1, student2, on="Name")
print(new.head())

Finally, we want to check to see if the merged data set is the same as the original data set.

print(student.equals(new))
  • Merge two data sets together by index number only.

a) First, select specific columns of a data set to create two smaller data sets.

newstudent1 = pd.concat([student["Name"], student["Sex"], student["Age"]], axis = 1)
print(newstudent1.head())

newstudent2 = pd.concat([student["Height"], student["Weight"]], axis = 1)
print(newstudent2.head())

b) Second, we want to join the two smaller data sets.

new2 = newstudent1.join(newstudent2)
print(new2.head())
c) Finally, we want to check to see if the joined data set is the same as the original data set.
print(student.equals(new2))
  • Create a pivot table to summarize information about a data set.

# Notice we are using a new data set that needs to be read into the
# environment
price = pd.read_csv('/Users/price.csv')
# The following code is used to remove the "," and "$" characters from
# the ACTUAL colum so that the values can be summed
from re import sub
from decimal import Decimal
def trim_money(money):
return(float(Decimal(sub(r'[^\d.]', '', money))))
price["REVENUE"] = price["ACTUAL"].apply(trim_money)
table = pd.pivot_table(price, index=["COUNTRY", "STATE", PRODTYPE", "PRODUCT"], values="REVENUE",
aggfunc=np.sum)
print(table.head())

pd.pivot_table() pd.pivot()

  • Return all unique values from a text variable.
print(np.unique(price["STATE"]))

np.unique()