pdf中页面内容对象和mupdf库的使用

需求：读取pdf中的直线的起点和终点，读取pdf中的圆的半径和圆心。

pdf文件的结构

pdf文件有四个部分组成：文件头、文件体、交叉引用表、以及文件尾。

文件头：

%PDF-1.7

文件尾：

%%EOF

文件体：

//对象体：会以下面的形式表示一个一个对象
N 0 obj ... endobj

交叉引用表：

//xref：交叉引用表，列出了 PDF 文件中每个对象的偏移量和状态。
xref
0 10
0000000000 65535 f
0000000015 00000 n
0000000119 00000 n
0000000173 00000 n
0000000413 00000 n
0000000443 00000 n
0000000464 00000 n
0000000573 00000 n
0000000675 00000 n
0000000751 00000 n
11 6
0000000782 00000 n
0000000814 00000 n
0000000846 00000 n
0000001090 00000 n
0000001164 00000 n
0000001240 00000 n

现在有一个简单的pdf文件，结构是比较简单的，一条直线和一个圆形。

可以直接使用vscode打开查看文件中的内容

%PDF-1.7
%〕抛
1 0 obj
<</Names<</Dests 4 0 R >>/Outlines 5 0 R /Pages 2 0 R /Type/Catalog/PieceInfo 16 0 R >>
endobj
2 0 obj
<</Count 1/Kids[ 6 0 R ]/Type/Pages>>
endobj
3 0 obj
<</Author()/Comments()/Company()/CreationDate(D:20250206231954+08'00')/Creator(WPS PDF)/Keywords()/ModDate(D:20250207003347+08'00')/Producer(WPS PDF)/SourceModified(D:20250206231954+08'00')/Subject()/Title()/Trapped/False>>
endobj
4 0 obj
<</Names []>>
endobj
5 0 obj
<<>>
endobj
6 0 obj
<</Contents 13 0 R /MediaBox[ 0 0 595.276 841.89]/Parent 2 0 R /Type/Page/Resources 8 0 R >>
endobj
7 0 obj
<</Filter /FlateDecode /Length 29>>
stream
x�+�5T0�B]eab╣a⿶溗��<��
endstream
endobj
8 0 obj
<</ExtGState<</KSPE1 9 0 R /KSPE2 11 0 R /KSPE3 12 0 R >>>>
endobj
9 0 obj
<</CA 1/ca 1>>
endobj
11 0 obj
<</CA 1/ca 1>>
endobj
12 0 obj
<</CA 1/ca 1>>
endobj
13 0 obj
<</Length 175/Filter/FlateDecode>>stream
x湑�;A唟N�	`�G谺-喳'蝞beH`蛄�<n励阐轶8l`�=钪偋葎迆競4%�4琝�v糂�$I甔暽4/p�=荥鯦频⒆F�竊!Ni諴跇g腊燻T
c蘢�岯�9/扔cV辘��ufLE琡WrN
螮縉鑃u6o骍o€髏�'NWK|
endstream
endobj
14 0 obj
<</ICV(BB2AFF3995078B501AD3A46727A0D885_41)/TYPE(TEXT)>>
endobj
15 0 obj
<</DocICV 14 0 R /LastModified(D:20250207003347--16'00')>>
endobj
16 0 obj
<</WPS_ICV 15 0 R >>
endobj
xref
0 10
0000000000 65535 f
0000000015 00000 n
0000000119 00000 n
0000000173 00000 n
0000000413 00000 n
0000000443 00000 n
0000000464 00000 n
0000000573 00000 n
0000000675 00000 n
0000000751 00000 n
11 6
0000000782 00000 n
0000000814 00000 n
0000000846 00000 n
0000001090 00000 n
0000001164 00000 n
0000001240 00000 n
trailer
<</Root 1 0 R /Info 3 0 R /Size 17/ID[<DF8B2D478E7140429B03CF1A7423C082><66E4328EF960ABDCE331A483259B11FB>]>>
startxref
1278
%%EOF
d

使用mupdf库来解析pdf的stream

使用vscode直接打开pdf文件，stream部分是这种乱码的格式。

<</Length 175/Filter/FlateDecode>>stream
x湑�;A唟N�	`�G谺-喳'蝞beH`蛄�<n励阐轶8l`�=钪偋葎迆競4%�4琝�v糂�$I甔暽4/p�=荥鯦频⒆F�竊!Ni諴跇g腊燻T
c蘢岯�9/扔cV辘��ufLE琡WrN
螮縉鑃u6o骍o€髏�'NWK|
endstream

实际上<< /Length 175 /Filter /FlateDecode >> 这行字典告诉我们，这个 stream 的实际长度是 175 字节，并且使用 FlateDecode（即 zlib/deflate 算法）进行了压缩。

这次我选择了mupdf库来完成本次开发，可以使用mutool clean -d input.pdf output.pdf来解压这个stream流，并重新打包。

现在，使用vscode打开这pdf以二进制的方式

13 0 obj
<<
  /Length 384
>>
stream
q
0 0 0 rg
0 0 0 RG
/KSPE1 gs
1 0 0 1 0 0 cm
182.4594 707.3305 m
329.1907 720.4292 l
S
Q
q
0 0 0 rg
0 0 0 RG
/KSPE1 gs
1 0 0 1 0 0 cm
157.4077 578.1296 m
157.4077 615.9448 188.1182 646.6001 226.0016 646.6001 c
263.8850 646.6001 294.5956 615.9448 294.5956 578.1296 c
294.5956 540.3143 263.8850 509.6591 226.0016 509.6591 c
188.1182 509.6591 157.4077 540.3143 157.4077 578.1296 c
h
S
Q
endstream
endobj

对stream流的解析

对于这stream流，每一行都是一个指令，用来绘图。

q / Q

q 表示“保存当前图形状态”（graphics state），
Q 表示“恢复图形状态”。
它们通常成对出现，用于保证在 q 和 Q 之间所做的颜色、变换等操作不会影响到外部。

0 0 0 rg / 0 0 0 RG

rg 表示设置“填充颜色”（fill color），这里是 0 0 0，也就是黑色 (RGB = (0,0,0))。
RG 则表示设置“描边颜色”（stroke color），同样是黑色。

1 0 0 1 0 0 cm

这是一个坐标变换矩阵，形式为 a b c d e f cm，用来改变后续绘图操作的坐标。
1 0 0 1 0 0 cm 等价于“单位变换”（无缩放、无旋转、无平移），相当于对后续指令不做额外变换。

m / l / c / h / S

m (move to)：移动到指定坐标（不画线）。
l (line to)：从当前位置画一条线到指定坐标。
c (curve to)：三次贝塞尔曲线指令，需要 3 组坐标来绘制平滑曲线。
h：关闭子路径（将当前点与起始点连成一条线闭合）。
S：对已有路径进行描边（stroke）。

根据上面的规则

直线的绘画指令是：第一组 182.4594 707.3305 m 到 329.1907 720.4292 l 通过移动并画线，然后 S 进行描边。
圆形的绘画指令是：使用了四条 c（曲线）命令，通过一系列控制点画出一个椭圆或者圆弧形闭合路径，然后再 S 进行描边。

如何通过代码去得到直线和圆形的相关信息呢

完整收集路径中所有的段（包括直线、三次贝塞尔曲线）。
当遇到“h”或新的“m”时，说明当前子路径结束；我们再分析该子路径里是否可能是一个圆
如果是一个圆，那么就进行圆心坐标的判断和半径的判断
对于其它指令（如 rg, gs, cm 等），可以忽略或者后期专门处理，避免把颜色或矩阵的数字当作坐标。

核心思路

初始化 MuPDF 上下文：创建一个 MuPDF 上下文 fz_context 用于后续的 PDF 处理。
打开 PDF 文档：加载指定路径的 PDF 文档，并获取页面总数。
遍历每一页：对每一页进行处理，加载页面内容。
提取内容流：从页面中提取内容流，如果内容是流类型（pdf_is_stream），则加载该流数据。
解析内容流：将流内容作为字符串传递给 parse_stream_content 函数，解析出页面中的直线和圆形。
输出结果：最后输出提取的直线和圆的信息，包括坐标和半径。

全部代码

#include <mupdf/fitz.h>
#include <mupdf/pdf.h>
#include <iostream>
#include <vector>
#include <cmath>
#include <string>
#include <sstream>
#include <regex>

//直线结构
struct Line{
    float x1,y1,x2,y2;
};

//圆的结构
struct Circle{
    float cx,cy,radius;
};

//绘制的段的类型
enum SegmentType{
    SEG_LINE,
    SEG_CUBIC,
};

//每个段的数据
struct Segment{
    SegmentType type;
    //对直线而言，只需要终点（x3，y3），起点由前一个m指令的位置决定的
    //对于三次贝塞尔，需要记录
    float x1,y1,x2,y2,x3,y3;
};

//子路径：从一次“m”到下次“m”或者h
struct SubPath{
    float start_x,start_y;
    bool closed =false;
    std::vector<Segment> segments;
};

//现在是一个比较简单的方式检测是不是圆形
bool detectCircle(const SubPath &path,float &cx,float &cy,float &r){
    //检测是不是最后闭合
    if(!path.closed) return false;
    //现在需要是4个三次贝塞尔才可以检测，还有一些其他方式画圆没考虑到
    if(path.segments.size()!=4) return false;
    for(auto &seg:path.segments){
        if(seg.type!=SEG_CUBIC) return false;
    }

    //检查最后一段的终点是不是回到第一段的起点(也就是start)
    auto &lastSeg=path.segments.back();
    float end_x=lastSeg.x3;
    float end_y=lastSeg.y3;
    float dx=end_x - path.start_x;
    float dy=end_y - path.start_y;
    if(std::sqrt(dx*dx+dy*dy)>1.0f){
        return false;
    }

    //通过计算每一个三次贝塞尔的终点的方式来计算圆心，这个方式需要是比较标准的4段贝塞尔才可以比较接近
    //后期可能需要使用最小二乘法，或者其他比较好的方式来计算
    std::vector<std::pair<float,float>> points;
    //points.push_back({path.start_x,path.start_y});
    for(auto &seg:path.segments){
        points.push_back({seg.x3,seg.y3});
    }

    //计算平均坐标
    float sumx=0,sumy=0;
    for(auto &p:points){
        sumx+=p.first;
        sumy+=p.second;
    }

    cx=sumx/points.size();
    cy=sumy/points.size();

    //计算平均半径
    float sumr=0;
    for(auto &p:points){
        float ddx=p.first-cx;
        float ddy=p.second-cy;
        sumr +=std::sqrt(ddx*ddx+ddy*ddy);
    }

    r=sumr/points.size();

    // 验证误差,有可能相差比较大，所以就需要一些计算
    float max_dev = 0.05f * r; // 允许 5% 误差
    for (auto &p : points) {
        float ddx = p.first - cx;
        float ddy = p.second - cy;
        float dist = std::sqrt(ddx*ddx + ddy*ddy);
        if (std::fabs(dist - r) > max_dev) {
            return false;
        }
    }
    return true;
}

// 解析内容流，提取线和圆
void parse_stream_content(const std::string& content,
                          std::vector<Line>& lines,
                          std::vector<Circle>& circles)
{
    std::istringstream stream(content);
    std::string token;
    std::vector<float> points;

    //使用子路径的方式来收集每一行指令
    std::vector<SubPath> subpaths;
    SubPath* currentPath=nullptr;

    //需要结束当前的子路径，并需要存进subpaths
    auto closeCurrentPath=[&](bool forcedClose){
        if(currentPath){
            subpaths.push_back(*currentPath);
            delete currentPath;
            currentPath=nullptr;
        }
    };

    bool debug = true;

    while(stream>>token){
        if(debug){
            std::cout<<"Token:"<<token<<std::endl;
        }
        //如果是m的话，先结束上一个子路径
        if(token=="m"){
            if(points.size()>=2){
                float mx=points[0];
                float my=points[1];
                
                //需要先结束上一个子路径
                closeCurrentPath(true);

                //新开一个路径SubPath,然后记录m开始的位置
                currentPath=new SubPath();
                currentPath->start_x=mx;
                currentPath->start_y=my;
            }
            //这里一定要清空，避免错误的下次记录
            points.clear();
        }
        else if(token =="l"){
            if(currentPath&&points.size()>=2){
                float x=points[0];
                float y=points[1];
                //记录标记为直线
                Segment seg;
                seg.type=SEG_LINE;
                //对于直线而言，需要终点
                seg.x3=x;
                seg.y3=y;
                currentPath->segments.push_back(seg);
            }
            points.clear();
        }else if(token=="c"){//如果是c，就需要存入全部的坐标
            if(currentPath&&points.size()>=6){
                float x1 = points[0], y1 = points[1];
                float x2 = points[2], y2 = points[3];
                float x3 = points[4], y3 = points[5];
                //记录为圆形
                Segment seg;
                seg.type = SEG_CUBIC;
                seg.x1 = x1;
                seg.y1 = y1;
                seg.x2 = x2;
                seg.y2 = y2;
                seg.x3 = x3;
                seg.y3 = y3;
                currentPath->segments.push_back(seg);  
            }
            points.clear();
        }else if(token=="h"){
            if(currentPath){
                currentPath->closed=true;
            }
            points.clear();
        }else{
            // 其它指令 (rg, RG, cm 等)或字符串，若非数字则清空 points
            try {
                float val = std::stof(token);
                points.push_back(val);
            }
            catch (...) {
                points.clear();
            }
        }
    }
    closeCurrentPath(true);

    //接下来是判断子路径，提取圆和直线
    for(auto &sp:subpaths){
        float cx,cy,r;
        //先检测圆,然后把圆心和半径给出来
        if(detectCircle(sp,cx,cy,r)){
            circles.push_back({cx,cy,r});
                        if (debug) {
                std::cout << "Detected circle: center=(" << cx << "," << cy
                          << "), radius=" << r << std::endl;
            }
        }else{
            //如果不是圆，那么就判断是不是直线段
            float curx = sp.start_x;
            float cury = sp.start_y;
            for (auto &seg : sp.segments) {
                if (seg.type == SEG_LINE) {
                    float dx = seg.x3 - curx;
                    float dy = seg.y3 - cury;
                    float length = std::sqrt(dx*dx + dy*dy);
                    if (length > 10.0f) {
                        lines.push_back({ curx, cury, seg.x3, seg.y3 });
                    }
                    // 更新当前点
                    curx = seg.x3;
                    cury = seg.y3;
                }
                else if (seg.type == SEG_CUBIC) {
                    curx = seg.x3;
                    cury = seg.y3;
                }
            }
        }
    }
}



void extract_shapes_from_pdf(const char* filename) {
    fz_context* ctx = fz_new_context(nullptr, nullptr, FZ_STORE_UNLIMITED);
    if(!ctx){
        std::cerr << "error MuPDF ctx" << std::endl;
        return; 
    }
    std::vector<Line> lines;
    std::vector<Circle> circles;

    fz_try(ctx){
        fz_register_document_handlers(ctx);
        fz_document* doc = fz_open_document(ctx, filename);
        if (!doc) {
            fz_throw(ctx, 1, "error PDF doc: %s", filename);
        }

        int page_count=fz_count_pages(ctx,doc);
        std::cout << "PDF 页面数: " << page_count << std::endl;

        for (int i = 0; i < page_count; ++i) {
            fz_page* page =fz_load_page(ctx,doc,i);
            // 转换为 PDF 页面
            pdf_page* pdf_pg = pdf_page_from_fz_page(ctx, page);
            if (!pdf_pg) {
                std::cerr << "不是有效的PDF页面" << std::endl;
                fz_drop_page(ctx, page);
                continue;
            }
            // 获取页面内容对象
            pdf_obj* contents = pdf_page_contents(ctx, pdf_pg);
            if (!contents) {
                std::cerr << "无法获取页面内容" << std::endl;
                fz_drop_page(ctx, page);
                continue;
            }
            if (pdf_is_stream(ctx, contents)) {
                fz_buffer* buffer = pdf_load_stream(ctx, contents);
                std::string content(reinterpret_cast<char*>(buffer->data), buffer->len);
                std::cout << std::string(reinterpret_cast<char*>(buffer->data), buffer->len) << std::endl;

                parse_stream_content(content, lines, circles);

                fz_drop_buffer(ctx, buffer);
            }
            fz_drop_page(ctx, page);
        }
        fz_drop_document(ctx, doc);
        // 输出结果
        std::cout << "提取的直线：" << std::endl;
        for (const auto& line : lines) {
            std::cout << "  起点 (" << line.x1 << ", " << line.y1 << "), 终点 (" << line.x2 << ", " << line.y2 << ")" << std::endl;
        }

        std::cout << "提取的圆：" << std::endl;
        for (const auto& circle : circles) {
            std::cout << "  圆心 (" << circle.cx << ", " << circle.cy << "), 半径 " << circle.radius << std::endl;
        }  
    }  
    fz_catch(ctx) {
        std::cerr << "发生错误：" << fz_caught_message(ctx) << std::endl;
    }
    fz_drop_context(ctx);

}

int main(int argc, char* argv[])
{
    if (argc < 2) {
        std::cerr << "Usage: " << argv[0] << " <PDF file>" << std::endl;
        return 1;
    }
    
    extract_shapes_from_pdf(argv[1]);
    return 0;
}