Data mining in pdf

https://towardsdatascience.com/how-to-extract-keywords-from-pdfs-and-arrange-in-order-of-their-weights-using-python-841556083341

 

Problem Statement -

Given a particular PDF/Text document ,How to extract keywords and arrange in order of their weightage using Python?

Dependencies :

(I have used Python 2.7.15 version for this tutorial.)

You will need below mentioned libraries installed on your machine for the task.In case you don’t have it,I have inserted codes for each dependency in code block below, which you can type it on command prompt for windows or on terminal for mac operating system.

  • PyPDF2 (To convert simple, text-based PDF files into text readable by Python)
pip install PyPDF2
  • textract (To convert non-trivial, scanned PDF files into text readable by Python)
pip install textract
  • re (To find keywords)
pip install regex

Note : I have attempted three approaches for this task.Above libraries would be suffice for approach 1.However I have just touched upon two other approaches which I found online.Treat them as alternatives. Down below is the jupyter notebook with all three approaches.Take a look!

Jupyter Notebook :

All necessary remarks are denoted with ‘#’.

  • Approach 1 unboxed

Step 1: Import all libraries.

Step 2: Convert PDF file to txt format and read data.

Step 3: Use “.findall()” function of regular expressions to extract keywords.

Step 4: Save list of extracted keywords in a DataFrame.

Step 5 : Apply concept of TF-IDF for calculating weights of each keyword.

Step 6 : Save results in a DataFrame and use “.sort_values()” to arrange keywords in order.

 
import pandas as pd
import numpy as np
import PyPDF2
import textract
import re


Reading Text

 
  • converted PDF file to txt format for better pre-processing
In [2]:
filename ='JavaBasics-notes.pdf' 

pdfFileObj = open(filename,'rb')               #open allows you to read the file
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)   #The pdfReader variable is a readable object that will be parsed
num_pages = pdfReader.numPages                 #discerning the number of pages will allow us to parse through all the pages


count = 0
text = ""
                                                            
while count < num_pages:                       #The while loop will read each page
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()
    
#Below if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files.

if text != "":
    text = text
    
#If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text

else:
    text = textract.process('http://bit.ly/epo_keyword_extraction_document', method='tesseract', language='eng')

    # Now we have a text variable which contains all the text derived from our PDF file.


For more details find GitHub repo HERE !

References :

  1. www.wikipedia.org

2. Medium post for PDF to Text Conversion

3. keyword extraction tutorial

4. Regular expressions

I hope you find this tutorial fruitful and worth reading. Also,I am sure there must be tons of other approaches with which you can perform the said task.Do share them in comment section if you have came across any.

Code for the Masked Word Cloud :

Find GitHub repo HERE !

# modules for generating the word cloud 
from os import path, getcwd
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud
%matplotlib inline 
d = getcwd()
text = open('nlp.txt','r').read()
#Image link = 'https://produto.mercadolivre.com.br/MLB-693994282-adesivo-decorativo-de-parede-batman-rosto-e-simbolo-grande-_JM'  
mask = np.array(Image.open(path.join(d, "batman.jpg")))

wc = WordCloud(background_color="black",max_words=3000,mask=mask,\
max_font_size=30,min_font_size=0.00000001,\
random_state=42,)
wc.generate(text)
plt.figure(figsize=[100,80])
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.savefig('bat_wordcloud.jpg',bbox_inches='tight',pad_inches=0.3)
posted @ 2018-10-26 11:15  兔子的尾巴_Mini  阅读(333)  评论(0编辑  收藏  举报