hdf5文件、tqdm模块、nunique、read_csv、sort_values、astype、fillna

pandas.DataFrame.to_hdf(self, path_or_buf, key, **kwargs):

Hierarchical Data Format (HDF) ,to add another DataFrame or Series to an existing HDF file, please use append mode and a different a key.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]},  index=['a', 'b', 'c'])
df.to_hdf('data.h5', key='df', mode='w', format='table')
# format : {‘fixed’, ‘table’}, default ‘fixed’
# ‘fixed’: Fixed format. Fast writing/reading. Not-appendable, nor searchable
# ‘table’: Table format. Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data
s
= pd.Series([1, 2, 3, 4]) s.to_hdf('data.h5', key='s') pd.read_hdf('data.h5', 'df') pd.read_hdf('data.h5', 's')

 

tqdm模块显示进度条:

tqdm(self, iterable=None, desc=None, total=None, leave=True, file=None, ncols=None, mininterval=0.1, maxinterval=10.0, miniters=None, ascii=None, disable=False, unit='it', unit_scale=False, dynamic_ncols=False, smoothing=0.3, bar_format=None, initial=0, position=None, postfix=None, unit_divisor=1000, write_bytes=None, gui=False, **kwargs)

iterable : iterable, optional

total : int, optional. The number of expected iterations. If unspecified, len(iterable) is used if possible. 

for x in tqdm(train_df['request_timestamp'].values,total=len(train_df)):
    localtime=time.localtime(x)
    wday.append(localtime[6])
    hour.append(localtime[3])

 https://lorexxar.cn/2016/07/21/python-tqdm/

https://tqdm.github.io/docs/tqdm/

 

pandas.DataFrame.nuniquehttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html

DataFrame.nunique(selfaxis=0dropna=True)

Count distinct observations over requested axis. Return Series with number of distinct observations. Can ignore NaN values.

>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})
>>> df
   A  B
0  1  1
1  2  1
2  3  1
>>> df.nunique()
A    3
B    1
dtype: int64
>>> df.nunique(axis=1)
0    1
1    2
2    2
dtype: int64

 

pandas.read_csv:

pandas.read_csv(...)常见参数:

sep str, default ‘,’

header int, list of int, default ‘infer’. Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None

names array-like, optional. List of column names to use.  Duplicates in this list are not allowed.

df=pd.read_csv('data/testA/totalExposureLog.out', sep='\t',names=['id','request_timestamp','position','uid','aid','imp_ad_size','bid','pctr','quality_ecpm','totalEcpm'])

 

pandas.DataFrame.sort_values:

DataFrame.sort_values(self, by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
# axis这个参数的默认值为0,匹配的是index,跨行进行排序,当axis=1时,匹配的是columns,跨列进行排序
# by这个参数要求传入一个字符或者是一个字符列表,用来指定按照axis的中的哪个元素来进行排序
# ascending这个参数的默认值是True,按照升序排序,当传入False时,按照降序进行排列
# kind这个参数表示按照什么样算法来进行排序,默认值是quicksort(快速排序),也可以传入mergesort(归并排序)或者是heapsort(堆排序)

df.sort_values(by='col1')
df.sort_values(by=['col1', 'col2'])

pandas.DataFrame.astype:

DataFrame.astype(self, dtype, copy=True, errors='raise', **kwargs)
# dtype : data type, or dict of column name
# Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df.dtypes

df.astype('int32').dtypes
df.astype({'col1': 'int32'}).dtypes

 

pandas.DataFrame.fillna: 

DataFrame.fillna(self, value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
# fillna()会填充nan数据,返回填充后的结果。如果希望在原DataFrame中修改,则把inplace设置为True

 

posted @ 2019-08-27 20:17  合唱团abc  阅读(603)  评论(0编辑  收藏  举报