How to extract text from PDF(Image) files, OCR

Background: below is SS1.0 as example since it came from NetSuite email plugin, SS2.0 is the same thing.

1. Registry a API key throw https://ocr.space/OCRAPI

There are limitations for Free Plan

2. Save the email attachment(PDF file) to NetSuite FileCabinet, set it to available without login, get the full url address, encode it.

var importFile = attachments[indexAtt];importFile.setIsOnline(true);
var intFileId = nlapiSubmitFile(importFile);
var strInvFileUrl = "https://" + nlapiGetContext().getCompany() + ".app.netsuite.com"+ objInvoiceFileRec.getURL();
strInvFileUrl = encodeURIComponent(strInvFileUrl);

3. Send Request to https://api.ocr.space/parse/imageurl?apikey=abcAPIKEYabc&filetype=PDF&isTable=true&url=

var response = nlapiRequestURL(strReqUrl, null, a);
There are varience of parameters for this API, in my case, it's invoice formated as table, that's why I send isTable=true to identify it; then it will help me to locate the expected cell and values.

4. Got and parsed the Response, we will get the Text messages on the PDF or Images.

var arrParsedLines = (objOcrRes['ParsedResults'] && objOcrRes['ParsedResults'][0]) ? objOcrRes['ParsedResults'][0]['TextOverlay']['Lines']: null;
var objVndBillData = parseDataFromInvPdf(arrParsedLines);

posted @ 2020-11-25 10:43 CarlZeng 阅读(208) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

How to extract text from PDF(Image) files, OCR

1. Registry a API key throw https://ocr.space/OCRAPI

2. Save the email attachment(PDF file) to NetSuite FileCabinet, set it to available without login, get the full url address, encode it.

3. Send Request to https://api.ocr.space/parse/imageurl?apikey=abcAPIKEYabc&filetype=PDF&isTable=true&url=

4. Got and parsed the Response, we will get the Text messages on the PDF or Images.

公告