域名已变更 请手动修改文章中域名指向carlzeng.com

How to extract text from PDF(Image) files, OCR

Background: below is SS1.0 as example since it came from NetSuite email plugin, SS2.0 is the same thing.

1. Registry a API key throw https://ocr.space/OCRAPI

There are limitations for Free Plan

2. Save the email attachment(PDF file) to NetSuite FileCabinet, set it to available without login, get the full url address, encode it.

var importFile = attachments[indexAtt];importFile.setIsOnline(true);
var intFileId = nlapiSubmitFile(importFile);
var strInvFileUrl = "https://" + nlapiGetContext().getCompany() + ".app.netsuite.com"+ objInvoiceFileRec.getURL();
strInvFileUrl = encodeURIComponent(strInvFileUrl);

 

3. Send Request to https://api.ocr.space/parse/imageurl?apikey=abcAPIKEYabc&filetype=PDF&isTable=true&url=

var response = nlapiRequestURL(strReqUrl, null, a);
There are varience of parameters for this API, in my case, it's invoice formated as table, that's why I send isTable=true to identify it; then it will help me to locate the expected cell and values.


4. Got and parsed the Response, we will get the Text messages on the PDF or Images.

var arrParsedLines = (objOcrRes['ParsedResults'] && objOcrRes['ParsedResults'][0]) ? objOcrRes['ParsedResults'][0]['TextOverlay']['Lines']: null;
var objVndBillData = parseDataFromInvPdf(arrParsedLines);

 

posted @   CarlZeng  阅读(215)  评论(0编辑  收藏  举报
编辑推荐:
· go语言实现终端里的倒计时
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
阅读排行:
· 分享一个免费、快速、无限量使用的满血 DeepSeek R1 模型,支持深度思考和联网搜索!
· 基于 Docker 搭建 FRP 内网穿透开源项目(很简单哒)
· ollama系列01:轻松3步本地部署deepseek,普通电脑可用
· 按钮权限的设计及实现
· 25岁的心里话
历史上的今天:
2014-11-25 财务报表 > 现金流表的直接法,间接法,Cash Flow from Operating Activites
2009-11-25 如何在网络包传输中捕获cookie
域名已变更 请手动修改文章中域名指向carlzeng.com
点击右上角即可分享
微信分享提示