ssh xxzhang@192.168.79.84
cd /home/xxzhang/data/Epigenome/Roadmap/
1、拆分原始bed文件,并重命名文件名
(base) [xxzhang@cu08 Roadmap]$ python /home/xxzhang/workplace/software/giggle/examples/rme/rename.py /home/xxzhang/workplace/software/giggle/examples/rme/states.txt /home/xxzhang/workplace/software/giggle/examples/rme/EDACC_NAME.txt "Roadmap_hg38_127/*gz" "split/"
Traceback (most recent call last):
File "/home/xxzhang/workplace/software/giggle/examples/rme/rename.py", line 5, in <module>
from itertools import imap
ImportError: cannot import name 'imap'
主要的原因是imap函数在python2中,需要通过itertools加载的。但是在python3中,直接用map函数作为替代。
参考链接:https://mlog.club/article/2009306
将代码中的imap函数替换为map函数即可。并注释掉“from itertools import imap”这一行。
代码即可运行。
但是,我们又出现了新的问题,我们的这个候选的文件其实并不包含gz的压缩(也是因为之前,我把它们解压的缘故)。
我现在需要重新解压原始的压缩包(.tgz格式)。
tar zxvf all_hg38lift.mnemonics.bedFiles.tgz -C orig/
python /home/xxzhang/workplace/software/giggle/examples/rme/rename.py /home/xxzhang/workplace/software/giggle/examples/rme/states.txt /home/xxzhang/workplace/software/giggle/examples/rme/EDACC_NAME.txt "orig/*gz" "split/"
这个代码可真是厉害啊
我们通过看下图,发现它将原始的组织的全部的bed文件拆分成了组织+染色质状态的bed文件。这样的话,我们就可以直接去看,候选的位点在什么组织什么染色质状态下富集,这是很有意思的角度。
而且是多线程处理,运行的速度极快。
这个代码我要学习一下。我还有很多要学的地方啊。自己要迭代升级了。
2、批量压缩bed文件
cd split/
ls *.bed |gargs -p 30 "bgzip {}" #这个应该批量的对其进行压缩
3、对序列进行排序
mkdir split_sort
/home/xxzhang/workplace/software/giggle/scripts/sort_bed "split/*gz" split_sort/ 30
这里也很值得学习,这里是对sort文件夹进行压缩备份
mkdir split_sort_raw
cd split_sort_raw/
cp ../split_sort/* .
ls *.bed.gz |gargs -p 30 "bgzip -d {}"
cd ..
4、构建索引
time giggle index -i "split_sort/*gz" -o split_sort_b -f -s
Indexed 56440237 intervals.
real 4m5.078s
user 3m9.876s
sys 0m8.330s
这一步真的让我战战兢兢,但是似乎到这里,是比较顺利的。
5、构建测试的query序列,与roadmap的索引进行比较
我相信真实的东西,是会非常简洁美丽的(这是我的核心价值观,如果你很费力的完成,说明用的方法不对,这是自然的基本的定律)。
(1)下载测试的数据集,并进行初步的筛选
wget ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1218nnn/GSM1218850/suppl/GSM1218850_MB135DMMD.peak.txt.gz
zcat GSM1218850_MB135DMMD.peak.txt.gz |awk '$8>=100' |bgzip -c >GSM1218850_MB135DMMD.peak.q100.bed.gz
(2)检索
time giggle search -i split_sort_b -q GSM1218850_MB135DMMD.peak.q100.bed.gz -s >GSM1218850_MB135DMMD.peak.q100.bed.gz.giggle.result
6、画图
python /home/xxzhang/workplace/software/giggle/scripts/giggle_heat_map.py -s /home/xxzhang/workplace/software/giggle/examples/rme/states.txt -c /home/xxzhang/workplace/software/giggle/examples/rme/EDACC_NAME.txt -i GSM1218850_MB135DMMD.peak.q100.bed.gz.giggle.result -o GSM1218850_MB135DMMD.peak.q100.bed.gz.3x11.pdf -n /home/xxzhang/workplace/software/giggle/examples/rme/new_groups.txt --x_size 3 --y_size 11 --stat combo --ablines 15,26,31,43,52,60,72,82,87,89,93,101,103,116,120,122,127 --state_names /home/xxzhang/workplace/software/giggle/examples/rme/short_states.txt --group_names /home/xxzhang/workplace/software/giggle/examples/rme/new_groups_names.txt
这里出现了一个报错。
Traceback (most recent call last):
File "/home/xxzhang/workplace/software/giggle/scripts/giggle_heat_map.py", line 2, in
import matplotlib
ImportError: No module named matplotlib
后来尝试安装matplotlib。
pip install matplotlib
Collecting matplotlib
Downloading matplotlib-3.3.4-cp36-cp36m-manylinux1_x86_64.whl (11.5 MB)
|████████████████████████████████| 11.5 MB 1.1 MB/s
Requirement already satisfied: numpy>=1.15 in /home/xxzhang/miniconda3/lib/python3.6/site-packages (from matplotlib) (1.19.5)
Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3
Downloading pyparsing-3.0.9-py3-none-any.whl (98 kB)
|████████████████████████████████| 98 kB 3.9 MB/s
Collecting pillow>=6.2.0
Downloading Pillow-8.4.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
|████████████████████████████████| 3.1 MB 23.1 MB/s
Collecting cycler>=0.10
Downloading cycler-0.11.0-py3-none-any.whl (6.4 kB)
Requirement already satisfied: python-dateutil>=2.1 in /home/xxzhang/miniconda3/lib/python3.6/site-packages (from matplotlib) (2.8.1)
Collecting kiwisolver>=1.0.1
Downloading kiwisolver-1.3.1-cp36-cp36m-manylinux1_x86_64.whl (1.1 MB)
|████████████████████████████████| 1.1 MB 80.8 MB/s
Requirement already satisfied: six>=1.5 in /home/xxzhang/miniconda3/lib/python3.6/site-packages (from python-dateutil>=2.1->matplotlib) (1.15.0)
Installing collected packages: pyparsing, pillow, kiwisolver, cycler, matplotlib
Successfully installed cycler-0.11.0 kiwisolver-1.3.1 matplotlib-3.3.4 pillow-8.4.0 pyparsing-3.0.9
似乎没啥用。
后来发现是numpy的版本不对。
>>> import matplotlib
RuntimeError: **module compiled against API version 0xc but this version of numpy is 0xb**
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/xxzhang/miniconda3/lib/python3.6/site-packages/matplotlib/__init__.py", line 174, in <module>
_check_versions()
File "/home/xxzhang/miniconda3/lib/python3.6/site-packages/matplotlib/__init__.py", line 159, in _check_versions
from . import ft2font
ImportError: numpy.core.multiarray failed to import
这个错误主要产生的原因,是numpy和matplotlib之间版本不匹配的问题。
解决的方法就是:
(1)卸载干净numpy
(2)重新安装numpy
(3)安装matplotlib
pip uninstall numpy
Found existing installation: numpy 1.16.4
Uninstalling numpy-1.16.4:
Would remove:
/home/xxzhang/miniconda3/bin/f2py
/home/xxzhang/miniconda3/bin/f2py3
/home/xxzhang/miniconda3/bin/f2py3.6
/home/xxzhang/miniconda3/lib/python3.6/site-packages/numpy-1.16.4.dist-info/*
/home/xxzhang/miniconda3/lib/python3.6/site-packages/numpy/*
Would not remove (might be manually added):
/home/xxzhang/miniconda3/lib/python3.6/site-packages/numpy/_import_tools.py
/home/xxzhang/miniconda3/lib/python3.6/site-packages/numpy/add_newdocs.py
/home/xxzhang/miniconda3/lib/python3.6/site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so
/home/xxzhang/miniconda3/lib/python3.6/site-packages/numpy/core/multiarray_tests.cpython-36m-x86_64-linux-gnu.so
/home/xxzhang/miniconda3/lib/python3.6/site-packages/numpy/core/operand_flag_tests.cpython-36m-x86_64-linux-gnu.so
/home/xxzhang/miniconda3/lib/python3.6/site-packages/numpy/core/struct_ufunc_test.cpython-36m-x86_64-linux-gnu.so
/home/xxzhang/miniconda3/lib/python3.6/site-packages/numpy/core/test_rational.cpython-36m-x86_64-linux-gnu.so
/home/xxzhang/miniconda3/lib/python3.6/site-packages/numpy/core/umath.cpython-36m-x86_64-linux-gnu.so
/home/xxzhang/miniconda3/lib/python3.6/site-packages/numpy/core/umath_tests.cpython-36m-x86_64-linux-gnu.so
/home/xxzhang/miniconda3/lib/python3.6/site-packages/numpy/distutils/environment.py
/home/xxzhang/miniconda3/lib/python3.6/site-packages/numpy/distutils/site.cfg
Proceed (y/n)? y
Successfully uninstalled numpy-1.16.4
pip uninstall numpy
Found existing installation: numpy 1.13.1
Uninstalling numpy-1.13.1:
Would remove:
/home/xxzhang/miniconda3/lib/python3.6/site-packages/numpy
/home/xxzhang/miniconda3/lib/python3.6/site-packages/numpy-1.13.1-py3.6.egg-info
Proceed (y/n)? y
Successfully uninstalled numpy-1.13.1
pip uninstall numpy
WARNING: Skipping numpy as it is not installed.
pip install numpy==1.19.5 -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
Looking in indexes: http://pypi.douban.com/simple
Collecting numpy==1.19.5
Downloading http://pypi.doubanio.com/packages/14/32/d3fa649ad7ec0b82737b92fefd3c4dd376b0bb23730715124569f38f3a08/numpy-1.19.5-cp36-cp36m-manylinux2010_x86_64.whl (14.8 MB)
|████████████████████████████████| 14.8 MB 6.7 MB/s
Installing collected packages: numpy
Successfully installed numpy-1.19.5
【参考链接】https://blog.csdn.net/road_of_god/article/details/121218570
测试bug是否解决
Python 3.6.13 |Anaconda, Inc.| (default, Feb 23 2021, 21:15:04)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import matplotlib #无报错输出
接下来,重新加载测试代码画图。
当然还有一些其他的报错。
Traceback (most recent call last):
File "/home/xxzhang/workplace/software/giggle/scripts/giggle_heat_map.py", line 269, in <module>
plt.yticks(np.arange(0.5,len(group_names)+0.5,1.0),[x.decode("utf8") for x in group_names], fontsize=10)
File "/home/xxzhang/workplace/software/giggle/scripts/giggle_heat_map.py", line 269, in <listcomp>
plt.yticks(np.arange(0.5,len(group_names)+0.5,1.0),[x.decode("utf8") for x in group_names], fontsize=10)
AttributeError: 'str' object has no attribute 'decode'
这个同样也还是python2和python3的区别,解决方法是直接去掉“.decode("utf8")”即可。
plt.yticks(np.arange(0.5,len(group_names)+0.5,1.0),[x.decode("utf8") for x in group_names], fontsize=10)
plt.yticks(np.arange(0.5,len(group_names)+0.5,1.0),[x for x in group_names], fontsize=10)
终于最后得到了测试结果的heatmap的图。
7、用我们自己的文件进行测试。
在测试之前,需要了解人家的测试文件长什么样子。
zcat GSM1218850_MB135DMMD.peak.txt.gz |head
chr1 714140 714505 . 21 . -1 9.05234147781908 -1
chr1 766123 766513 . 40 . -1 20.6816682872462 -1
chr1 773683 774053 . 39 . -1 20.1453797317428 -1
chr1 785319 785660 . 25 . -1 12.6626688107994 -1
chr1 805180 805679 . 89 . -1 53.0106435758877 -1
chr1 821480 822377 . 66 . -1 37.4996160251318 -1
chr1 876990 877319 . 24 . -1 8.81398119324741 -1
chr1 894436 894717 . 17 . -1 6.04980472690903 -1
chr1 896632 896875 . 21 . -1 7.62492868913338 -1
chr1 934524 934836 . 15 . -1 5.26813111266616 -1
也不过是一个普通的peak文件,并且作者进一步将quality>100的peak,作为进一步的处理的对象。
接下来,我们想要加载我们的数据。
(1)所有的人特异性的序列
cp /home/xxzhang/workplace/project/giggle_workplace/new_trial/human_specific_repeat_uniq.bed.gz ./
time giggle search -i split_sort_b -q human_specific_repeat_uniq.bed.gz -s >human_specific_repeat_uniq.bed.gz.giggle.result
real 0m5.337s
user 0m3.662s
sys 0m0.715s
python /home/xxzhang/workplace/software/giggle/scripts/giggle_heat_map.py -s /home/xxzhang/workplace/software/giggle/examples/rme/states.txt -c /home/xxzhang/workplace
/software/giggle/examples/rme/EDACC_NAME.txt -i human_specific_repeat_uniq.bed.gz.giggle.result -o human_specific_repeat_uniq.bed.gz.3x11.pdf -n /home/xxzhang/workplace/software/giggle/examples/rme/new_groups.txt --x_size 3 --y_size 11 --stat combo --ablines 15,26,31,43,52,60,72,82,87,89,93,101,103,116,120,122,127 --state_names /home/xxzhang/workplace/software/giggle/examples/rme/short_states.txt --group_names /home/xxzhang/workplace/software/giggle/examples/rme/new_groups_names.txt
-911.6124454198172 427.1105279465215
得到的结果就挺让人失望的。
(2)分四大家族来看结果
- LINE
cp /home/xxzhang/workplace/project/giggle_workplace/sub_family/LINE/Hs_LINE.bed.gz ./
time giggle search -i split_sort_b -q Hs_LINE.bed.gz -s >Hs_LINE.bed.gz.giggle.result
real 0m2.052s
user 0m1.268s
sys 0m0.332s
python /home/xxzhang/workplace/software/giggle/scripts/giggle_heat_map.py -s /home/xxzhang/workplace/software/giggle/examples/rme/states.txt -c /home/xxzhang/workplace/software/giggle/examples/rme/EDACC_NAME.txt -i Hs_LINE.bed.gz.giggle.result -o Hs_LINE.bed.gz.3x11.pdf -n /home/xxzhang/workplace/software/giggle/examples/rme/new_groups.txt --x_size 3 --y_size 11 --stat combo --ablines 15,26,31,43,52,60,72,82,87,89,93,101,103,116,120,122,127 --state_names /home/xxzhang/workplace/software/giggle/examples/rme/short_states.txt --group_names /home/xxzhang/workplace/software/giggle/examples/rme/new_groups_names.txt
-1145.4813572909884 403.57550238181346
- LTR
cp /home/xxzhang/workplace/project/giggle_workplace/sub_family/LTR/Hs_LTR.bed.gz ./
time giggle search -i split_sort_b -q Hs_LTR.bed.gz -s >Hs_LTR.bed.gz.giggle.result
real 0m0.654s
user 0m0.289s
sys 0m0.191s
python /home/xxzhang/workplace/software/giggle/scripts/giggle_heat_map.py -s /home/xxzhang/workplace/software/giggle/examples/rme/states.txt -c /home/xxzhang/workplace/software/giggle/examples/rme/EDACC_NAME.txt -i Hs_LTR.bed.gz.giggle.result -o Hs_LTR.bed.gz.3x11.pdf -n /home/xxzhang/workplace/software/giggle/examples/rme/new_groups.txt --x_size 3 --y_size 11 --stat combo --ablines 15,26,31,43,52,60,72,82,87,89,93,101,103,116,120,122,127 --state_names /home/xxzhang/workplace/software/giggle/examples/rme/short_states.txt --group_names /home/xxzhang/workplace/software/giggle/examples/rme/new_groups_names.txt
-93.71706066802582 295.22627402031566
- SINE
cp /home/xxzhang/workplace/project/giggle_workplace/sub_family/SINE/Hs_SINE.bed.gz ./
time giggle search -i split_sort_b -q Hs_SINE.bed.gz -s >Hs_SINE.bed.gz.giggle.result
real 0m3.344s
user 0m2.045s
sys 0m0.547s
python /home/xxzhang/workplace/software/giggle/scripts/giggle_heat_map.py -s /home/xxzhang/workplace/software/giggle/examples/rme/states.txt -c /home/xxzhang/workplace/software/giggle/examples/rme/EDACC_NAME.txt -i Hs_SINE.bed.gz.giggle.result -o Hs_SINE.bed.gz.3x11.pdf -n /home/xxzhang/workplace/software/giggle/examples/rme/new_groups.txt --x_size 3 --y_size 11 --stat combo --ablines 15,26,31,43,52,60,72,82,87,89,93,101,103,116,120,122,127 --state_names /home/xxzhang/workplace/software/giggle/examples/rme/short_states.txt --group_names /home/xxzhang/workplace/software/giggle/examples/rme/new_groups_names.txt
-353.71735915110963 174.10855052614107
- SVA
cp /home/xxzhang/workplace/project/giggle_workplace/sub_family/SVA/Hs_SVA.bed.gz ./
time giggle search -i split_sort_b -q Hs_SVA.bed.gz -s >Hs_SVA.bed.gz.giggle.result
real 0m1.071s
user 0m0.521s
sys 0m0.204s
python /home/xxzhang/workplace/software/giggle/scripts/giggle_heat_map.py -s /home/xxzhang/workplace/software/giggle/examples/rme/states.txt -c /home/xxzhang/workplace/software/giggle/examples/rme/EDACC_NAME.txt -i Hs_SVA.bed.gz.giggle.result -o Hs_SVA.bed.gz.3x11.pdf -n /home/xxzhang/workplace/software/giggle/examples/rme/new_groups.txt --x_size 3 --y_size 11 --stat combo --ablines 15,26,31,43,52,60,72,82,87,89,93,101,103,116,120,122,127 --state_names /home/xxzhang/workplace/software/giggle/examples/rme/short_states.txt --group_names /home/xxzhang/workplace/software/giggle/examples/rme/new_groups_names.txt
-395.7527136893095 937.5406208326722