[Machine Learning with Python] How to get your data?

 

Using Pandas Library

The simplest way is to read data from .csv files and store it as a data frame object:

import pandas as pd
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)

You can also read .xsl files and directly select the rows and columns you are interested in by setting parameters skiprows, usecols. Also, you can indicate index column by parameter index_col.

energy=pd.read_excel('Energy Indicators.xls', sheet_name='Energy',skiprows=8,usecols='E,G', index_col=None, na_values=['NA'])

For .txt files, you can also use read_csv function by defining the separation symbol:

university_towns=pd.read_csv('university_towns.txt',sep='\n',header=None)

See more about pandas io operations in http://pandas.pydata.org/pandas-docs/stable/io.html

Using os Module

Read .csv files:

import os
import csv
for file in os.listdir("objective_folder"):
	with open('objective_folder/'+file, newline='') as csvfile: 
	rows = csv.reader(csvfile) # read csc file 
	for row in rows: # print each line in the file
		print(row)

Read .xsl files:

import os
import xlrd
for file in os.listdir("objective_folder/"):
	data = xlrd.open_workbook('objective_folder/'+file)
	table = sheel_1 = data.sheet_by_index(0)#the first sheet in Excel
	nrows = table.nrows #row number
	for i in range(nrows): 
		if i == 0: # skip the first row if it defines variable names
		continue
		row_values = table.row_values(i) #read each row value
		print(row_values)

Download from Website Automatically

We can also try to read data directly from url link. This time, the .csv file is compressed as housing.tgz. We need to download the file and then decompress it. So you can write a small function as below to realize it. It is a worthy effort because you can get the most recent data every time you run the function.

 1 import os
 2 import tarfile
 3 from six.moves import urllib
 4 DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"     
 5 HOUSING_PATH = "datasets/housing"     
 6 HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"
 7 def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH): 
 8     if not os.path.isdir(housing_path):
 9         os.makedirs(housing_path)
10         tgz_path = os.path.join(housing_path, "housing.tgz")
11         urllib.request.urlretrieve(housing_url, tgz_path)
12         housing_tgz = tarfile.open(tgz_path)
13         housing_tgz.extractall(path=housing_path)
14         housing_tgz.close()

when you call fetch_housing_data(), it creates a datasets/housing directory in your workspace, downloads the housing.tgz file, and extracts the housing.csv from it in this directory.
Now let’s load the data using Pandas. Once again you should write a small function to load the data:

import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH): 
    csv_path = os.path.join(housing_path, "housing.csv") 
    return pd.read_csv(csv_path)

 

What’s more?

These methods are what I have met so far. In typical environments your data would be available in a relational database (or some other common datastore) and spread across multiple tables/documents/files. To access it, you would first need to get your credentials and access authorizations, and familiarize yourself with the data schema. I will supplement more methods if I encounter in the future.

posted @ 2018-12-29 04:18  Sherrrry  阅读(193)  评论(0编辑  收藏  举报