杂项

结巴直接分词

  python -m jieba -d ' ' allTrain.txt > train_contents.txt

使用redis

  cmd1 :redis-server.exe redis.windows.conf

  cmd2:redis-cli.exe -h 127.0.0.1 -p 6379

  scrapy-redis src- scrapy-redis copy- scrapy project

redis

  keys * 列出

  https://github.com/rmax/scrapy-redis

  type jobbole:requests :类型

  zrange jobbole:requests 0 1 :zset元素

  scard jobbole:dupefilter  :set元素数量

  smembers jobbole:dupefilter :获得key

查看mysql文件夹位置

  show global variables like "%datadir%"

打开 tensorflow summary 的目录 执行 tensorboard --logdir=C:\redis\logs

  TensorBoard 0.1.6 at http://DESKTOP-FIPG2GH:6006 (Press CTRL+C to quit) 便可以在浏览器输入 localhost:6006 查看tensorflow 模型相关 graph  HISTOGRAMS

jupyter

  'sha1:f0147912cfac:fe72a5a54b1bb234881e4fdc5d04419d70dc4e58'

LINUX下批量修改文件夹下面的文件名

  i=1; for x in *; do mv $x $i.扩展名; let i=i+1; done

 删除文件夹及文件夹下所有内容

  rm -rf folder

python 替换掉字符串中的换行符

  str.replace('\n',' ')

RE处理数据

 1 import re
 2 import os
 3 dir_list = [dirs for dirs in sorted(os.listdir()) if dirs.endswith('.json')]
 4 print("JSON文件:{0}".format(len(dir_list)))
 5 path = '../pubmedData/'
 6 if not os.path.exists(path):
 7     os.makedirs(path)
 8     
 9 for file in dir_list:
10     print("正在处理:{0}".format(file))
11     with open(file,'r') as f:
12         x = f.read()
13     cit_pubmed = re.findall('cit {(.*?)Pubmed-entry',x,re.DOTALL)
14     print("匹配到的总数:{0}".format(len(cit_pubmed)))
15 
16     i = 0
17     j = 0
18     k = 0
19     set_title_list = []
20     set_abstract_list = []
21     set_issn_list = []
22     issn_class = []
23     for y in range(len(cit_pubmed)):
24         #title
25         title = re.findall('title {(.*?)authors {',cit_pubmed[y],re.DOTALL)
26         set_title_list.append(len(title))
27         if len(title) == 2:
28             i += 1
29             title = re.findall('name "(.*?)."', title[0], re.DOTALL)
30         if len(title) == 1:
31             title = re.findall('name "(.*?)."', title[0], re.DOTALL)
32             i += 1
33         
34         #issn
35         issn = re.findall('issn "(.*?)",',cit_pubmed[y], re.DOTALL)
36         if len(issn) == 1:
37             #abstract
38             abstract = re.findall('abstract "(.*?).",',cit_pubmed[y],re.DOTALL)
39             if len(abstract) == 1:
40                 with open(path + issn[0] + '.txt','a') as f:
41                     f.write(abstract[0].replace("\n", " ") + '\n')
42                 j += 1
43             set_abstract_list.append(len(abstract))
44             
45             issn_class.append(issn[0])
46             k += 1
47         set_issn_list.append(len(issn))
48         
49     set_title_list = set(set_title_list) 
50     set_abstract_list = set(set_abstract_list)
51     set_issn_list = set(set_issn_list)
52     print("TITLE种类:{0},总数:{1}".format(set_title_list, i))
53     print("ABSTRACT种类:{0},总数:{1}".format(set_abstract_list, j))
54     print("ISSN种类:{0},总数:{1}".format(set_issn_list, k))
55     print("ISSN_CLASS:{0}类".format(len(set(issn_class))))

 numpy argsort() 

1 import numpy as np
2 x=np.array([5,4,3,2,1])
3 y = x.argsort()
4 #output array([4, 3, 2, 1, 0])

取出ndarray 中最大的五个数的index

x=np.array([[5,4,3,2,1,7,8,9],[1,2,3,4,5,9,8,6]])
y = map(lambda label: label.argsort()[-1:-6:-1], x)
t = list()
t.extend(y)
#result [array([7, 6, 5, 0, 1]), array([5, 6, 7, 4, 3])]

numpy.hstack() horizontal 水平的 a = array([1,2,3]) b = array([4,5,6])  c = array([1,2,3,4,5,6])

numpy.vstack() vertical 垂直的 a = array([1,2,3]) b = array([4,5,6])  c = array([1,2,3],[4,5,6])

统计数组中出现次数最少的两个值

1 from collections import Counter
2 a = [1,2,3,4,2,3,4,5]
3 x = Counter(a).most_common()[-2:]

 查看文件夹大小

  du -h --max-depth=1 pubmedData

查看单个文件大小

  ls -sh 1932-6203.txt

列出当前文件夹下前十个最大的文件

  du -a | sort -n -r | head -n 10

python 引用 

 1 x = [1,2,3]
 2 y = x
 3 print (y)
 4 >>[1,2,3]
 5 x.pop()
 6 print (y)
 7 >>[1,2]
 8 x = [1,2,3]
 9 y = x[:]
10 print (y)
11 >>[1,2,3]
12 x.pop()
13 print (y)
14 >>[1,2,3]

 Python中一个对象有两个头部信息 1.型标志符 标识对象的类型 2.引用计数器 用来决定是不是可以回收这个变量  

类型属于对象的不属于变量  python变量 是在特定的时间引用了特定的变量 a = 123(整数) a = '123'(字符串) a = 1.23(float)

对象的垃圾收集 a = 123(整数) a = '123'(字符串) a = 1.23(float) 如果a 从指向int对象123 变成指向str对象‘123’则int对象123就要进行回收 被回收的空间自动放到 # 自由内存空间池 #

递归计算任意结构list元素和

def sum(l):
    total = 0
    for x in l:
        if  not isinstance(x, list):
            total += x
        else:
            total += sum(x)
    return total
sum([[1,2,3],[1,[2]]])

 

posted @ 2017-11-21 15:28  WangLC  阅读(322)  评论(0编辑  收藏  举报