Data Analysis with Pandas-(2)-Dataframe basics

Read in a CSV file

Let's get started using pandas. Because pandas builds on numpy, some things will be familiar, but others will take some time to get used to. The first step, as always, is reading in a csv file. With pandas, we'll need to use the read_csv method. With pandas, the equivalent of a two-dimensional numpy array, or matrix, is called a dataframe.

Indexing data with pandas

Indexing in pandas is slightly different from in numpy. We need to use the .iloc[] method (note how it uses square brackets instead of parentheses).

The above lines of code will get the first element of the first row, the whole first column, and the whole first row from a numpy array, respectively.

The above is the equivalent indexing in a pandas dataframe.

Index a series

What we call a vector or a one-dimensional array is called a series in pandas. We index a series without using .iloc[].

Get column by name

One cool thing about pandas is that we can get columns with their names, instead of only by number.

Getting multiple columns by name

We can get a single column by passing in a name, but we can also get multiple columns by passing in a list of names.

Math with columns

We can do math with vectors (or, as we are more familiar with them as, columns). Adding two columns will go through and add each value at each position to the corresponding value in the same position. First, the values at index 0 will be added, then the values at index 1, and so on. At the end, you'll have a new vector with all of the sums. In order to do this kind of math, the vectors all need to be the same length (have the same number of elements).

Math with scalars

We can also do math with a vector (column) and a scalar (single numbers).

The statement above will divide each item in the "Iron_(mg)" column by 1000, converting the values from milligrams to grams. Multiplying by scalars can be a good way to manipulate values in columns.

Sorting columns

We can also sort the dataframe by the values in a column. To do this, we use the .sort method, which returns a new dataframe sorted according to the specified criteria.

Multicolumn sort

We only sorted on the value of a single column before, but we can also sort based on multiple columns. The first column in the list will be sorted on first. Any rows that have the same value for the first column will be sorted based on the second column, and so on until all of the given columns are used. We can specify different sort orders for each column using the ascending argument.

Creating a rating

Let's say that we have a friend, Superman, who's trying to gain muscle, and so he wants to eat foods that have a lot of protein. But he's also very health conscious, so he wants to avoid fat. We could sort based on those two columns and give him an answer, but what if he cares more about a food having a lot of protein than it being too fatty? What we need to do is construct a rating for each food based on Superman's criteria. Then we can recommend the food that scores the highest.

Normalizing columns

One of the simplest ways to normalize a column is to divide all of the values by the maximum value in the column. It will change all of the values to be between 0 and 1. It doesn't work so well with negative values, but we don't have any (you can't have negative amounts of protein, fat, etc). This isn't necessarily the best way to normalize, and we'll learn some better methods soon.

Adding a new column

We can add a column to a dataframe by assigning to it.

The above code assigns double the amount of lipids to the "double_fat" column in food_info.

posted @ 2015-11-06 22:15 每天灬进步一点阅读(360) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

每天灬进步一点

Data Analysis with Pandas-(2)-Dataframe basics

公告