合并中英文字幕文件的简单python脚本

今天下了几部电影，找字幕文件的时候发现只有中文字幕或者英文字幕，没有双语字幕文件。对于习惯了看中英双语字幕的我来说看着单语言还是有点别扭，用notepad++分别打开两个文件发现srt字幕文件其实就是时间轴+字幕的简单格式，我想着用python写一段简单的脚本就可以实现合并，顺便把代码也发给大家参考~

使用说明

字幕文件默认为srt格式，可以更改sub_type变量更换为其他文本格式

中、英两个字幕文件需要文件名保持一致，其中中文字幕文件以Chs结尾，英文字幕文件以Eng结尾，如：

The.Adventures.of.Tintin.丁丁历险记.S01E01.The.Crab.with.the.Golden.Claws.(Part1).金钳螃蟹贩毒集团(上).720p.BluRay.Chs.srt
The.Adventures.of.Tintin.丁丁历险记.S01E01.The.Crab.with.the.Golden.Claws.(Part1).金钳螃蟹贩毒集团(上).720p.BluRay.Eng.srt

字幕文件需遵循固定格式，如:

3
00:01:46,691 --> 00:01:47,646
Did you get it?

4
00:01:48,372 --> 00:01:51,332
Yes. This is what to look for.

其中文本行（如第3行、第7行）支持多行文本

字幕文件需存放在如下目录：\merge，如：

folder
-- merge.py
--\merge
   --sub.Chs.srt
   --sub.Eng.srt

源代码

import os
import re


def doublication_index(s1='00:00:01,000 --> 00:00:06,000', s2='00:00:01,000 --> 00:00:06,000'):  # 判断两个时间区间的重合度
    time1 = []  # time1用来存储以h：m：s, ms格式的时间
    time2 = []  # time2用来存储以xxx s.yyy ms格式的时间
    time1.extend(s1.split(' --> '))
    time1.extend(s2.split(' --> '))
    for i in time1:
        split_time = i.split(':')
        calc_time = int(split_time[0]) * 3600 + int(split_time[1]) * 60 + int(split_time[2].split(',')[0]) + int(split_time[2].split(',')[1]) / 1000
        time2.append(calc_time)

    if time2[2] > time2[1] or time2[0] > time2[3]:
        return 0  # 不重合的情况，重合度0
    time2.sort()
    return (time2[2] - time2[1]) / (time2[3] - time2[0])

path = os.getcwd() + '/merge'  # 将工作路径设为 ./merge
sub_type = 'srt'  # 字幕类型，如srt，txt等

for file in os.listdir(path):
    if re.search(r"Chs\." + sub_type + '$', file):  # 首先寻找中文字幕文件
        print('正在处理文件：' + file)
        eng_file = file.replace('Chs', 'Eng')
        if os.path.isfile(eng_file):  # 判断英文文件是否存在
            with open(file, 'r') as f:
                chs = f.readlines()  # 读入中文字幕文件
            with open(eng_file, 'r') as f:
                eng = f.readlines()  # 读入英文字幕文件

            chs_time, eng_time = [], []  # 定义两个变量存储中文字幕和英文字幕时间轴的行序号
            for i in range(len(chs)):
                if re.search('^\d\d:\d\d:\d\d', chs[i]):
                    chs_time.append(i)
            for i in range(len(eng)):
                if re.search('^\d\d:\d\d:\d\d', eng[i]):
                    eng_time.append(i)
            # 首先根据中文字幕的时间轴寻找最匹配的英文字幕时间轴
            for i in range(len(chs_time) - 1):
                for j in range(len(eng_time) - 1):
                    if doublication_index(chs[chs_time[i]], eng[eng_time[j]]) > 0.5:
                        for xh in range(eng_time[j] + 1, eng_time[j + 1]-2):
                            chs[chs_time[i + 1] - 3] += eng[xh]
            # 由于根据下一段时间轴判断字幕文件行数，因此最后一条字幕需要单独处理
            rows_chs, rows_eng = 0, 0
            while chs[chs_time[-1] + rows_chs] != '\n':
                rows_chs += 1
            while eng[eng_time[-1] + rows_eng] != '\n':
                rows_eng += 1
            for xh in range(1, rows_eng):
                chs[chs_time[-1] + rows_chs - 1] += eng[eng_time[-1] + xh]

            with open(file, 'w') as f:
                f.writelines(chs)

posted @ 2020-04-26 11:28 海淀区小吴同学阅读(978) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

海淀区小吴同学

欢迎访问我的博客！

合并中英文字幕文件的简单python脚本

使用说明

源代码

公告