Data Analysis with Pandas-(1)-Getting started with matrices

1. Reading data into NumPy

NumPy is a Python module that has a lot of functions for working with data. If you want to do serious work with data in Python, you'll be using a lot of NumPy. We'll work through importing NumPy and loading in a csv file.

2. Fixing the data types

If you looked at the data you read in last screen, you may have noticed that it looked very strange. This is because genfromtxt reads the data into a NumPy array. Every element in an array has to be the same data type. So everything is a string, or everything is an integer, and so on. NumPy tried to convert all of our data to floats, which caused the values to become strange. We'll need to specify the data type when we read our data in so we can avoid that.

3. Indexing the data

Now that we know how to read in a file, let's start pulling values out. Remember how all elements in a matrix have an index? We can print the item at row 1, column 2, by typing print world_alcohol[0,1]

4. Vectors

When we grab a whole row or column from the matrix, we actually end up with a vector. Just like a matrix is a 2-dimensional array because it has rows and columns, a vector is a 1-dimensional array. Vectors are similar to Python lists in that they can be indexed with only one number. Think of a vector as just a single row, or a single column.

5. Array shape

All arrays, whether they are 1-dimensional (vectors), two dimensional (matrices), or even larger, have a number of elements in each dimension. For example, a matrix may have 200 rows and 10 columns. We can use the shape method to find these dimensions.

6. Boolean elements

We can also use boolean statements on arrays to get truth values. The interesting part about this is that the booleans are computed elementwise.

The above code will actually compare each element of the fourth column of world_alcohol, check if it equals "Beer", and create a new vector with the True/False values.

7. Subsets of vectors

We can subset vectors based on boolean vectors like the ones we generated in the last screen.

The code above will select and print only the elements in the fourth column whose value is "Beer". world_alcohol[:,3][beer] goes through each position in the fourth column vector (from 0 to the last index), and checks if the beer vector is True at the same position. If the beer vector is True, it assigns the element of the fourth column at that position to the subset. If the beer vector is False, the element is skipped.

8. Subsets of matrices

We can subset a matrix in the same way that we can subset a vector.

The above code will print all of the rows in world_alcohol where the "Type" column equals "Beer". Note how because matrices are indexed using two numbers, we are substituting the boolean vector beer for the first number. We can alter the second number to select different columns.

The above code would select the second column where the "Type" column equals "Beer".

9. Subsets with multiple conditions

So now we can find all of the rows that correspond to "Algeria", for example. But what if what we really want is to find all the rows for "Algeria" in "1985"?

We'll have to use multiple conditions to generate our vector.

The code above will generate a boolean that uses multiple conditions. How it works is that the parentheses specify that the two component vectors should be generated first. (order of operations)Then the two vectors will be compared index by index. If both vectors are True at index 1, then the resulting vector will be True at index 1. If either vector is False at index 1, the result will be False at index 1. Here's an expanded example:

We can add more than 2 conditions if we want -- we just have to put an & symbol between each one. The resulting vector will contain True in the position corresponding to rows where all conditions are True, and False for rows where any condition is False.

10. Convert a column to floats

We now know almost everything we need to compute how much alcohol the people in a country drank in a given year! But there are a couple of things we need to work through first. First, we need to convert the "Liters of alcohol drunk" column (the fifth one) to floats. We need to do this because they are strings now, and we can't take the sum of strings. Because they aren't numeric, their sum wouldn't make much sense. We can use the astype method on the array to do this.

11. Replace values in an array

There are values in our alcohol consumption column that are preventing us from converting the column from floats to strings. In order to fix this, we first have to learn how to replace values. We can replace values in a NumPy array by just assigning to them with the equals sign.

The code above will replace any item in the alcohol consumption column that contains '0' (remember that the world alcohol matrix is all string values) with '10'.

12. Convert the alcohol consumption column to floats

Now that you know what the bad value is, we can replace it and then convert the column to floats.

13. Compute the total alcohol consumption

We can compute the total value of a column using the sum method.

14. Finding how much alcohol a person in a country drank in a year

We can subset a vector with another vector, as we learned earlier. This means that we can find the total alcohol consumed by any given country in any given year now.

15. A function to sum yearly alcohol consumption

Now that we know how to find the total alcohol consumption of the average person in a country in a given year, we can make a function out of it. A function will make it easier for us to calculate the alcohol consumption for all countries.

 16. Finding the country that drinks the least

We can now loop over our dictionary keys to find the country with the lowest amount of alcohol consumed per person in 1989.

posted @ 2015-11-05 00:18  每天灬进步一点  阅读(262)  评论(0编辑  收藏  举报