hdf5文件、tqdm模块、nunique、read_csv、sort_values、astype、fillna

pandas.DataFrame.to_hdf(self, path_or_buf, key, **kwargs)：

Hierarchical Data Format (HDF) ，to add another DataFrame or Series to an existing HDF file， please use append mode and a different a key.

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]},  index=['a', 'b', 'c'])
df.to_hdf('data.h5', key='df', mode='w', format='table')
# format : {‘fixed’, ‘table’}, default ‘fixed’
# ‘fixed’: Fixed format. Fast writing/reading. Not-appendable, nor searchable
# ‘table’: Table format. Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data

s = pd.Series([1, 2, 3, 4])
s.to_hdf('data.h5', key='s')

pd.read_hdf('data.h5', 'df')
pd.read_hdf('data.h5', 's')

tqdm模块显示进度条：

tqdm(self, iterable=None, desc=None, total=None, leave=True, file=None, ncols=None, mininterval=0.1, maxinterval=10.0, miniters=None, ascii=None, disable=False, unit='it', unit_scale=False, dynamic_ncols=False, smoothing=0.3, bar_format=None, initial=0, position=None, postfix=None, unit_divisor=1000, write_bytes=None, gui=False, **kwargs)

iterable : iterable, optional

total : int, optional. The number of expected iterations. If unspecified, len(iterable) is used if possible.

for x in tqdm(train_df['request_timestamp'].values,total=len(train_df)):
    localtime=time.localtime(x)
    wday.append(localtime[6])
    hour.append(localtime[3])

https://lorexxar.cn/2016/07/21/python-tqdm/

https://tqdm.github.io/docs/tqdm/

pandas.DataFrame.nunique：https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html

DataFrame.nunique(self, axis=0, dropna=True)

Count distinct observations over requested axis. Return Series with number of distinct observations. Can ignore NaN values.

>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})
>>> df
   A  B
0  1  1
1  2  1
2  3  1
>>> df.nunique()
A    3
B    1
dtype: int64
>>> df.nunique(axis=1)
0    1
1    2
2    2
dtype: int64

pandas.read_csv:

pandas.read_csv(...)常见参数：

sep : str, default ‘,’

header : int, list of int, default ‘infer’. Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None.

names : array-like, optional. List of column names to use. Duplicates in this list are not allowed.

df=pd.read_csv('data/testA/totalExposureLog.out', sep='\t',names=['id','request_timestamp','position','uid','aid','imp_ad_size','bid','pctr','quality_ecpm','totalEcpm'])

pandas.DataFrame.sort_values:

DataFrame.sort_values(self, by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
# axis这个参数的默认值为0，匹配的是index，跨行进行排序，当axis=1时，匹配的是columns，跨列进行排序
# by这个参数要求传入一个字符或者是一个字符列表，用来指定按照axis的中的哪个元素来进行排序
# ascending这个参数的默认值是True，按照升序排序，当传入False时，按照降序进行排列
# kind这个参数表示按照什么样算法来进行排序，默认值是quicksort（快速排序），也可以传入mergesort（归并排序）或者是heapsort（堆排序）

df.sort_values(by='col1')
df.sort_values(by=['col1', 'col2'])

pandas.DataFrame.astype:

DataFrame.astype(self, dtype, copy=True, errors='raise', **kwargs)
# dtype : data type, or dict of column name
# Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df.dtypes

df.astype('int32').dtypes
df.astype({'col1': 'int32'}).dtypes

pandas.DataFrame.fillna：

DataFrame.fillna(self, value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
# fillna()会填充nan数据，返回填充后的结果。如果希望在原DataFrame中修改，则把inplace设置为True

posted @ 2019-08-27 20:17 合唱团abc 阅读(603) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

合唱团abc

hdf5文件、tqdm模块、nunique、read_csv、sort_values、astype、fillna

公告