PDFs are one of the most used data formats for business documents. Many businesses and organizations depend on various tools to create and read these PDF documents.
However, it’s hard to extract specific/important data from PDFs selectively.
It’s one of the most loved data formats for information exchange. Especially when it comes to web applications, most of the data is communicated using JSON through APIs and DB queries.
In this blog post, we’ll be looking at:
- How Nanonets automates complex data conversion from complicated business PDF documents to structured JSON files.
- How to extract specific/complex data from PDFs such as tables and specific strings of text.
- Custom workflows that can help automate the process of converting PDFs to JSON.
Want to extract specific data from PDF documents and convert to JSON? Check out Nanonets API to automate batch PDF to JSON conversion from any kind of technical document!
Nanonets Automated PDF to JSON Converter
- Sign up for Nanonets’ free plan that offers a 100 page credit – no credit card needed.
- Add a batch of your business PDF files
- Nanonets auto-captures fields from a range of document types (invoices, receipts, driver’s license, passports & tables)
- You can also train Nanonets’ AI to detect/capture just the data fields of your interest from any kind of document!
- Verify the extracted data and export as JSON outputs
- You can also integrate Nanonets with a host of ERP software – schedule a call with our AI experts to test-drive your use case.
- Check out our OCR API to automate PDF to JSON workflows
Want to capture data from PDF documents and convert to JSON, csv or Excel? Find out how Nanonets can help.
The Need for PDF to JSON Conversion
Almost every business relies on documents for information sharing. These can be documentation, invoices, tax filings, receipts, medical reports and so much more.
These documents are often shared/received as PDFs.
But if you want to search for critical information or build a dashboard to analyse and store all the important information, manually collecting data from these PDFs can be an uphill task.
If the PDFs are electronically generated, we can copy-paste information into data sources; else, we might have to use OCR and machine learning techniques to extract information.
Also, the data in the PDFs is not organised or directly machine-readable. Therefore, we might have to search for information manually.
But when it comes to JSON, everything is organised in key-value pairs. Here’s an example.
"company_name": "Company Name", "Invoice_date": "Date ", "Invoice_total":"$0.00", "Invoice_line_items: "", "Invoice_tax": ""
If you can see the above JSON format, the data is more organised, and you could also share this information on the web more conveniently. This is why exporting data from PDFs into JSON is crucial for a lot of companies.
Business Benefits that Come with JSON
JSON data format has a lot of advantages over PDFs for businesses:
- JSON is Faster: JSON syntax is easy to use; whenever you’re trying to parse through any JSON data, the execution is much faster when compared to PDFs and other data formats. This is because the syntax is lightweight and executes the response quickly.
- More Readable: JSON data is more readable; we’ll have a straightforward data mapping with keys and values. Therefore, if you’re searching for something or organising the data from PDFs, JSON will be more convenient. Additionally, JSON supports the nesting of data, and with this, data from tables can be stored more efficiently.
- Convenient Schema: JSON is universal for most operating systems and programming languages; Therefore, if you’re building any software or web application to automate your business, JSON should be the right data format. Also, most web browsers support JSON format; hence we don’t have to put in additional effort to use third-party software to read through JSON data.
- Easy Sharing: JSON is the best tool for sharing data of any size, even large tables or text etc. This is because JSON stores data in the arrays, so data transfer makes it more accessible. For this reason, JSON is a superior file format for web APIs and web development.
In the next section, let’s look at some of the challenges that we may face when converting PDFs to JSON format.
Nanonets has many interesting use cases that could optimize your business performance, save costs and boost growth. Find out how Nanonets’ use cases can apply to your product.
Challenges with Converting from PDF to JSON
Let’s look at some of the challenges in exporting from PDFs to JSON.
- Detecting fonts: People use different fonts, colours, and alignments inside PDF documents. Therefore, it is really hard for parsers to read these. Also, while exporting this, we’ll have to define specific rules so that after the parser extracts the data, all the information should be mapped correctly in the JSON format. In such cases, regular expressions are widely used to pick out specific text and then to export it to the correct key in the JSON format.
- Detecting text from scanned documents: As discussed, when the PDFs are not electronically generated, we will have to use an OCR and choosing an OCR is crucial. Though a lot of users try open-source tools like tesseract, they have their own set of limitations. For example, if the text is improperly captured or misaligned when capturing, tesseract might not work, and choosing other tools can be expensive.
- Identifying Tables: Most business documents contain tabular information, and determining these tables from PDF documents and converting them into JSON is a challenging task. There are some libraries based on Python and Java that can help extract tables from electronically made PDF documents.
- Identifying Tables from Scanned PDFs: When the PDFs are scanned, most packages don’t work. In this case, if we choose an open-source OCR like tesseract, it could extract text but can lose all the table formatting. Therefore, it’s challenging to pick outline items in an incorrect format. This is where we’ll have to use Machine Learning and Deep Learning-based algorithms. Some popular algorithms are based on CNNs, and there has been lots of research going on in improving these algorithms.
Below are some of the research papers that solve the problem of table extraction from documents:
In the next section, let’s look at how to parse data from PDF to generate JSON files.
Parsing Data from PDFs and Generating JSON Files using Python and Linux
Parsing through PDFs isn’t a complicated task if you have developer experience.
Firstly, we’ll have to check if our PDF files contain text data or consist of scanned images. We’d have to check if we can extract text data or pipe the files through an OCR library if no text was returned.
This could be achieved using a Python library or by relying on some Linux command-line utilities.
Pdftotext is one of the most popular libraries to parse electronic PDFs. We could use this to convert all the PDF data into text format and then push it into a JSON format.
Here are some of the instructions on how we can use
pdftotext and parse through PDF on a Linux machine.
First, install command-line tools:
sudo apt-get install poppler-utils
Next, use the
pdftotext command and add the PDF file’s source path and destination text file location.
pdftotext PDF-file text-file
With this, we should be able to extract all the readable text from the PDF files.
To generate a JSON file, we will have to again work on a script based on our data that can parse through the text and export them into relevant key-value pairs.
Here’s an example script that we wrote in Python that converts a simple
.txt file into JSON format.
import json filename="data.txt" dict1 = with open(filename) as fh: for line in fh: command, description = line.strip().split(None, 1) dict1[command] = description.strip() # creating json file # the JSON file is named as test1 out_file = open("test1.json", "w") json.dump(dict1, out_file, indent = 4, sort_keys = False) out_file.close()
Consider the data inside the text file to be:
invoice_id #234 invoice_name Invoice from AWS invoice_total $345
Here, we first imported the inbuilt JSON library. We now create a dictionary data type to store all the key-value pairs from the text files. Next, we iterate through every line in the file and strip it into command, description and keep it in the created dictionary. Lastly, we make a new JSON file and use the
json.dump method to dump the dictionary into the JSON file with a specific configuration that includes sorting and indentation.
However, our data from PDFs will not be as organised as given in the example; therefore, we might have to use custom pipelines and scripts to go through complicated text formatting. In such cases, tools like Nanonets will be of great choice, and we’ll also look at how Nanonets solves this problem in a much easier way in the following sections.
Before that, let’s look at one more library that converts PDF to JSON using node.js:
pdf2json is a node.js module that parses and converts PDF from binary to JSON format; it’s built with pdf.js and extends it with interactive form elements and text content parsing outside the browser.
Here’s an example of using this module to parse your JSON files:
First, make sure to have
npm installer and install the module using the following command:
npm install pdf2json
Next, in your node server, you can use the following snippet that loads the pdf2json and exports pdf’s to JSON:
let fs = require('fs'), PDFParser = require("pdf2json"); let pdfParser = new PDFParser(); pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError) ); pdfParser.on("pdfParser_dataReady", pdfData => fs.writeFile("./pdf2json/test/F1040EZ.json", JSON.stringify(pdfData)); ); pdfParser.loadPDF("./pdf2json/test/pdf/fd/form/F1040EZ.pdf");
The above code snippet uses an example JSON file from the module and exports it into a JSON file, we can check this out in the
./test/target/ folder in your project. B
elow, you’ll find a screenshot of how the module exports the JSON files:
For parsing through PDFs in tables, these libraries might just not work!
You’ll have to leverage OCR & Machine Learning algorithms to extract tabular data into JSON. Nanonets does just that as you can see below:
Customised Data Conversion from PDF to JSON
Sometimes, while extracting the data from business documents, we might require customisation. For example, say if we only want certain pages or tables, we can’t do it directly. In this case, we might need to provide additional rules to the parsers, which is again time-consuming. But let’s see how we can do the customisation and the actions that most people need.
Below are some of the actions that are required for customisation in PDF to JSON conversion:
- Extract only particular text or pages from PDFs
- Extract all the tables from PDF documents
- Extract particular columns from certain tables in PDFs
- Filter text from PDFs before exporting them into JSON
- Creating nested JSON based on the extracted data from PDFs
- Format JSON structure based on data
- Create, delete, update values of certain fields in JSON after extraction
These are some of the actions that are often required for storing our data in different ways, or say if we are building APIs for an application. Let’s see how we can achieve these.
Extracting Particular Text: In PDFs, we could extract the particular text using regular expressions; for example, say if we want all the emails and phone numbers using regex, we can pick them out. If the PDFs are in scanned format, we need to either train them on a deep learning algorithm that can understand the layouts of the PDFs and extract fields based on the coordinates and annotation made to the training data. One of the most popular open-source repositories for understanding document layouts and extracting text is LayoutML, and it trains on BERT models for custom text extraction. However, we should have enough data to achieve higher accuracy in extracting text.
Table Customisation: As discussed, tables can be extracted using libraries like Camelot and Tabula-py or using OCR and deep learning-based algorithms. But for customisation, we will have to use libraries like pandas; this will allow us to create, update, and serialise the data from the tables. It uses a custom data type called a data frame, which is widely used for manipulating and customising the table data. Other advantages of using pandas include writing custom functions that can perform certain math operations during the extraction process.
Formatting JSON Data: After exporting PDFs into JSON, formatting them is a straightforward task, as we have a more customisable data type which is the key-value pairs. We could either develop simple scripts or use online tools to search through these key-value pairs and format them. Some of the most common parameters for formatting include indentation, separators, sorting keys, circular checks, data checks. If the JSON is being used as an API, we could use Postman or any browser extensions to format the data and interact with the APIs.
Want to extract information from PDF documents and convert them into a JSON format? Check out Nanonets to automate export of any information from any PDF document into JSON.