pybedtools：在Python中使用BEDTools

I have a Python, I have a BEDTools.
Ah,pybedtools!

前言

pybedtools 是采用 Python 对BEDTools 软件的封装和扩展。为啥要用pybedtools ，而不是直接使用BEDTools 呢？当然是我们想在Python 使用 BEDTools 啦（😀）

Why pybedtools? 作者也是给了一个例子
找出在基因间隔区的SNPs, 然后获取离SNPs距离<5kb以内的基因名字。
若用pybedtools的话：

import pybedtools

snps = pybedtools.example_bedtool('snps.bed.gz')
genes = pybedtools.example_bedtool('hg19.gff')

intergenic_snps = snps.subtract(genes)                  
nearby = genes.closest(intergenic_snps, d=True, stream=True) 

for gene in nearby:             
    if int(gene[-1]) < 5000:    
        print(gene.name)

而若用shell 脚本的话,你可能需要这些代码：

snps=snps.bed.gz
genes=hg19.gff
intergenic_snps=/tmp/intergenic_snps

snp_fields=`zcat $snps | awk '(NR == 2){print NF; exit;}'`
gene_fields=9
distance_field=$(($gene_fields + $snp_fields + 1))

intersectBed -a $snps -b $genes -v > $intergenic_snps

closestBed -a $genes -b $intergenic_snps -d \
  awk '($'$distance_field' < 5000){print $9;}' \
  perl -ne 'm/[ID|Name|gene_id]=(.*?);/; print "$1\n"'

rm $intergenic_snps

通过比较，若用pybedtools，你只需要熟悉Python 和pybedtools。而用shell实现相同功能的话，你可能需要Perl, bash, awk 和bedtools。

对于我而言，主要还是因为使用pybedtools，可以让我全程使用Python代码得到和bedtools同样的结果。

使用

安装

通过conda

$ conda install --channel conda-forge --channel bioconda pybedtools

通过pip 安装

$ pip install pybedtools -i https://pypi.tuna.tsinghua.edu.cn/simple
注：不能在ｗｉｎｄｏｗ系统下安装

Create a BedTool

使用pybedtools ，第一步就是导入pybedtools 包和创建BedTool对象。BedTool对象封装了BedTools程序所有可用的程序功能，使它们可以更好的在Python中使用。所以，大多情况，我们用pybedtools ，即在操作BedTool对象。BedTool对象的创建可以使用interval file (BED, GFF, GTF, VCF, SAM, or BAM format)。
由文件创建BedTool对象：

# 其中'test.bed' 是文件路径。
test = pybedtools.BedTool('test.bed')

此外，也可以从字符串里创建：

s = '''
chrX  1  100
chrX 25  800
'''
## 由参数from_string控制
a = pybedtools.BedTool(s, from_string=True)

pybedtools 也提供了一些测试文件，下文将使用这些文件作为演示。

# list_example_files会列出示例文件
>>> pybedtools.list_example_files()

['164.gtf',
...
 'a.bed',
 'a.bed.gz',
 'b.bed',
 'bedpe.bed',
 'bedpe2.bed',
 'c.gff',
 'd.gff',
...
 'venn.b.bed',
 'venn.c.bed',
 'x.bam',
 'x.bed',
 'y.bam']

使用示例文件来创建BedTool对象

## 传入文件名'a.bed'即可
a = pybedtools.example_bedtool('a.bed')

查看前几行数据

>>> a.head()
chr1	1	100	feature1	0	+
 chr1	100	200	feature2	0	+
 chr1	150	500	feature3	0	-
 chr1	900	950	feature4	0	+

Intersections

BEDTools 常用来取交集，来看看在pybedtools怎样操作

a = pybedtools.example_bedtool('a.bed')
b = pybedtools.example_bedtool('b.bed')
a_and_b = a.intersect(b)

>>> a.head()
chr1    1   100 feature1    0   +
chr1    100 200 feature2    0   +
chr1    150 500 feature3    0   -
chr1    900 950 feature4    0   +

>>> b.head()
chr1    155 200 feature5    0   -
chr1    800 901 feature6    0   +

>>> a_and_b.head()
chr1    155 200 feature2    0   +
chr1    155 200 feature3    0   -
chr1    900 901 feature4    0   +

若是查看，是否重叠，和intersectBed 用法一样，添加参数u：

a_with_b = a.intersect(b, u=True)

>>> a_with_b.head()
chr1	100	200	feature2	0	+
 chr1	150	500	feature3	0	-
 chr1	900	950	feature4	0	+

运行intersect方法后，返回的是一个新的 BedTool 示例，其会将内容暂时保存在一个硬盘临时地址上。临时地址获取：

>>> a_and_b.fn
'/tmp/pybedtools.kfnxvuuz.tmp'
>>> a_with_b.fn
'/tmp/pybedtools.qre024y9.tmp'

Saving the results

BedTool.saveas() 方法可以复制BedTool实例指向的临时文件到一个新的地址。同时可以添加trackline，用于直接上传UCSC Genome Browser,，而不需要再次打开文件添加 trackline。

c = a_with_b.saveas('intersection-of-a-and-b.bed', trackline='track name="a and b"')
print(c.fn)
# intersection-of-a-and-b.bed

BedTool.saveas() 方法返回一个新的BedTool实例指向新的地址。

而BedTool.moveto()用于直接移动文件到一个新的地址，若你不需要添加trackline，文件又很大时，这个方法会很快。

d = a_with_b.moveto('another_location.bed')
print(d.fn)
# 'another_location.bed'

既然已经移动的了，也即老的文件，不存在了,若再次查看其内容会报错.

>>> a_with_b.head()
FileNotFoundError                         Traceback (most recent call last)
/tmp/ipykernel_2075/544676037.py in <module>
.........

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pybedtools.4uqthcml.tmp'

Default arguments

BEDTools 使用中，我们会指定-a 和-b参数，而在之前的操作中，我们并没有指定。
当这是因为pybedtools给出了默认设置，进行方法调用的实例作为-a, 传入的另一个实例作为-b，即exons对应-a ("a.bed"), snps对应-b("b.bed")

import pybedtools
exons = pybedtools.example_bedtool('a.bed')
snps = pybedtools.example_bedtool('b.bed')
exons_with_snps = exons.intersect(snps, u=True)

$ intersectBed -a a.bed -b b.bed -u > tmpfile

上面两者是一样的。
不过也可以显示的指定a,b

# these all have identical results
x1 = exons.intersect(snps)
x2 = exons.intersect(a=exons.fn, b=snps.fn)
x3 = exons.intersect(b=snps.fn)
x4 = exons.intersect(snps, a=exons.fn)
x1 == x2 == x3 == x4
# True

Chaining methods together (pipe)

方法的链式调用，类似linux shell下的管道操作
例如：
a 和b查看交集与否，然后合并坐标区间，分开运行是这样

x1 = a.intersect(b, u=True)
x2 = x1.merge()

链式调用可以这样：

x3 = a.intersect(b, u=True).merge()

再来个例子

x4 = a.intersect(b, u=True).saveas('a-with-b.bed').merge().saveas('a-with-b-merged.bed')

Operator overloading

操作符重载，一个amazing 的方法！

x5 = a.intersect(b, u=True)
x6 = a + b

x5 == x6
# True

x7 = a.intersect(b, v=True)
x8 = a - b

x7 == x8
# True

即 + 操作符表示了 intersectBed 使用 -u 蚕食, - 操作符表示了 intersectBed 使用 -v 参数。

使用了操作符重载还是可以继续链式调用的

x9 = (a + b).merge()

Intervals

在pybedtools中，以Interval对象来表示BED，GFF，GTF或VCF文件中的一行数据。注上文及下文都是是在Python3版本进行演示（不会还有人用Python2吧）

还是继续创建一个BedTool对象作为例子，

a = pybedtools.example_bedtool('a.bed')
feature = a[0]
features = a[1:3]

BedTool 支持切片操作, 获取单行元素是一个Interval对象

>>> type(feature)
pybedtools.cbedtools.Interval

Common Interval attributes

打印Interval对象，会返回其所代表行数据。

>>> print(feature)
chr1	1	100	feature1	0	+

所有feature（指Interval），不管由什么文件类型创建而来BedTool，都具有chrom, start, stop, name, score, and strand 属性。其中，start, stop是整数，而其他所有其他（包括score）都是字符串。
不过，我们应该经常会有，只有chrom, start, stop 的数据，我们看看pybedtools如何处理其余缺失属性。

>>> feature2 = pybedtools.BedTool('chrX 500 1000', from_string=True)[0]
>>> print(feature2)
chrX	500	1000

貌似，print(feature2)打印的原始行数据。

>>> feature2.chrom
'chrX'
>>> feature2.start
500
>>> feature2.stop
1000 
>>> feature2.name
'.'
>>>feature2.score
'.'
>>>feature2.strand
'.'

缺失默认值是“.”.

Indexing into Interval objects

Interval 可以通过list 和 dict 的方式进行索引。

>>> feature2[:]
['chrX', '500', '1000']
>>> feature2[0]
'chrX'
>>> feature2["chrom"]
'chrX'

不过以字典方式进行索引有个好处，对缺失值可以自动返回默认值，而通过整数索引会报错

>>> feature2["name"]
'.'
>>> feature2[3]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/tmp/ipykernel_2075/566460392.py in <module>
----> 1 feature2[3]

/opt/conda/lib/python3.9/site-packages/pybedtools/cbedtools.pyx in pybedtools.cbedtools.Interval.__getitem__()

IndexError: field index out of range

Fields

Interval 有一个 Interval.fields 属性，将原始行分割成一个list. 使用整数索引时，对这个fields list 进行索引。

>>> f = pybedtools.BedTool('chr1 1 100 asdf 0 + a b c d', from_string=True)[0]
>>> f.fields
['chr1', '1', '100', 'asdf', '0', '+', 'a', 'b', 'c', 'd']
>>> len(f.fields)
10

BED is 0-based, others are 1-based

多格式使用时，一个麻烦的事，BED文件的坐标系统不同于GFF/GTF/VCF文件.

BED 文件是0-based，（染色体上第一个位置为0），特征不包含stop 位置
GFF, GTF, and VCF 文件是 1-based （即染色体上第一个位置为1）特征包含stop位置

为了方便，pybedtools 做了这些约定：

Interval.start 是0-based start position，不管什么文件。即使是GFF,或者其他1-based feature。
使用len() 获取Interval的长度，总是返回Interval.stop - Interval.start.这样不管什么文件格式，都能保证长度正确。也简化了潜在代码。我们可以人为所有Intervals 是一样的
Interval.fields 内容全是字符串，是对原始行的分割。
- 所以对于GFF feature， Interval.fields[3] 和 Interval[3] 与 Interval.start 是不一样的。即Interval.fields[3] = Interval.start +1.因为Interval.fields[3]是原始的1-based 数据，而Interval.start 采用0-based

Worked example

下面举两个例子
创建一个GFF Interval

gff = ["chr1",
       "fake",
       "mRNA",
       "51",   # <- start is 1 greater than start for the BED feature below
       "300",
       ".",
       "+",
       ".",
       "ID=mRNA1;Parent=gene1;"]
gff = pybedtools.create_interval_from_list(gff)

创建一个和之前 gff 坐标一样的BED Interval。但它们的坐标的系统不一样。

bed = ["chr1",
       "50",
       "300",
       "mRNA1",
       ".",
       "+"]
bed = pybedtools.create_interval_from_list(bed)

确认下各自的文件格式，

>>> gff.file_type
'gff'
>>> bed.file_type
'bed'

各自的原始表示

>>> print(gff)
chr1	fake	mRNA	51	300	.	+	.	ID=mRNA1;Parent=gene1;
>>> print(bed)
chr1	50	300	mRNA1	.	+

>>> bed.start == gff.start == 50
True
>>> bed.end == gff.end == 300
300
>>> len(bed) == len(gff) == 250
True

GFF features have access to attributes

GFF and GTF files have lots of useful information in their attributes field (the last field in each line). 这些attributes 可以通过Interval.attrs访问。

>>> print(gff)
chr1	fake	mRNA	51	300	.	+	.	ID=mRNA1;Parent=gene1;
>>> gff.attrs
{'ID': 'mRNA1', 'Parent': 'gene1'}
>>> gff.attrs['Awesomeness'] = "99"
>>> gff.attrs['ID'] = 'transcript1'

可以直接修改attrs

gff.attrs['Awesomeness'] = "99"
gff.attrs['ID'] = 'transcript1'
print(gff.attrs)
# {'ID': 'transcript1', 'Parent': 'gene1', 'Awesomeness': '99'}
print(gff)
# chr1	fake	mRNA	51	300	.	+	.	ID=transcript1;Parent=gene1;Awesomeness=99;

Filtering

BedTool.filter() 可以对BedTool 对象进行过滤。传递一个函数，其接收的第一个参数是一个Interval. 返回True/False来进行过滤。

挑选>100bp的特征

a = pybedtools.example_bedtool('a.bed')
b = a.filter(lambda x: len(x) > 100)
print(b)
# chr1	150	500	feature3	0	-

也可以定义的更加通用。挑选长度由传入参数决定

def len_filter(feature, L):
    "Returns True if feature is longer than L"
    return len(feature) > L

a = pybedtools.example_bedtool('a.bed')
print(a.filter(len_filter, L=200))
# chr1        150     500     feature3        0       -

也是支持链式调用

a.intersect(b).filter(lambda x: len(x) < 100).merge()

Fast filtering functions in Cython

featurefuncs 模块里包含了一些用Cython实现的功能，相比用纯Python,速度大约提升70%左右。

from pybedtools.featurefuncs import greater_than

def L(x,width=100):
    return len(x) > 100

### Cython 的长度比较大于实现
test.filter(greater_than, 100)
### 纯Python的大于实现
test.filter(L, 100)

Each

相似BedTool.filter(), 作用函数应用于每个Interval,根据返回布尔值来判断保留与否。BedTool.each()也是将函数应用于每个Interval，但主要是对Interval进行修改。

a = pybedtools.example_bedtool('a.bed')
b = pybedtools.example_bedtool('b.bed')

## intersect" with c=True, 重复特征的计数
with_counts = a.intersect(b, c=True)

然后定义一个函数，进行计数的归一化，

def normalize_count(feature, scalar=0.001):
    """
    assume feature's last field is the count
    """
    counts = float(feature[-1])
    normalized = round(counts / (len(feature) * scalar), 2)

    # need to convert back to string to insert into feature
    feature.score = str(normalized)
    return feature

>>> with_counts.head()
chr1	1	100	feature1	0	+	0
 chr1	100	200	feature2	0	+	1
 chr1	150	500	feature3	0	-	1
 chr1	900	950	feature4	0	+	1

>>> normalized = with_counts.each(normalize_count)
>>> print(normalized)
chr1        1       100     feature1        0.0     +       0
chr1        100     200     feature2        10.0    +       1
chr1        150     500     feature3        2.86    -       1
chr1        900     950     feature4        20.0    +       1

链式调用：

a.intersect(b, c=True).each(normalize_count).filter(lambda x: float(x[4]) < 1e-5)

参考

https://daler.github.io/pybedtools/main.html

觉得有用的话扫码关注下~

posted @ 2022-01-03 16:46 何物昂阅读(3526) 评论(0) 收藏举报

刷新页面返回顶部

Loading

何物昂