PDF to Excel conversion: Your ultimate guide to the best tools

    In an excellent world, the info we have to analyze could be obtainable in ready-to-use format. In the world we reside in, although, quite a lot of helpful knowledge is locked inside Portable Document Format (PDF) paperwork. How to extract that knowledge from PDFs into an Excel spreadsheet? You have quite a few PDF to Excel converters to select from.There’s software program from main distributors like Microsoft and Adobe, task-specific cloud companies together with PDFTables and Cometdocs, companies from general-purpose cloud suppliers equivalent to Amazon, and even free open-source choices.Which is the “best” PDF to Excel converter? As with the “best computer,” the reply depends upon your particular circumstances.3 inquiries to ask when selecting a PDF to Excel converterThere are a number of vital concerns when deciding on a PDF converter.1. Was my PDF generated by an software or is it a scanned picture? There are two varieties of PDF information. One is generated by an software like Microsoft Word; the opposite comes from a scanned or different picture file. You can inform which one you could have by making an attempt to focus on some textual content within the doc. If a click on and drag works to focus on textual content, your PDF is app-generated. If it doesn’t, you’ve received a scan. Not all PDF conversion instruments work on scanned PDFs.2. How advanced is the info construction? Almost each device will work nicely on a easy one-page desk. Things get extra difficult if tables are unfold over a number of pages, desk cells are merged, or some knowledge inside a desk cell wraps over a number of traces. 3. Do I’ve a big quantity of information that want batch file conversions or automation? Our best-performing device on app-generated PDFs will not be the only option for you if you wish to automate frequent batch conversions.In addition, as with every software program alternative, that you must determine how a lot you worth efficiency versus price and ease of use. How we examined the converter toolsTo enable you discover what’s finest on your duties, we examined seven PDF to Excel conversion instruments utilizing 4 completely different PDF information starting from easy to nightmare. You’ll see how all of the instruments carry out in every state of affairs — and discover out the strengths and weaknesses of every one.PDF to Excel converters we testedHere are the instruments we examined, beginning with our general finest performers (however keep in mind that “best” relies upon partially on the precise supply doc). All these instruments did fairly nicely on not less than a few of our duties, so rankings vary from “Excellent” to “Good.”Adobe Acrobat Export PDF subscriptionAs the creator of the Portable Document Format customary, you’d count on Adobe to do nicely in parsing PDFs — and it does. A full-featured conversion subscription is considerably dear, however there’s additionally a reasonable $2/month plan (annual subscription required) that features a limiteless variety of PDF to Excel conversions. (You can output Microsoft Word information with this device as nicely).The Excel conversions embody any textual content on pages which have each textual content and tables. This could be a profit in case you’d wish to maintain that context or a disadvantage in case you simply need knowledge for added evaluation. Rating: Excellent — our hands-down winner for non-scanned PDFs.Cost: $24/yrPros: Outstanding outcomes; preserves a lot of the unique formatting; offers nicely with tables spanning a number of pages; limitless conversions of information as much as 100MB; inexpensive for frequent customers.Cons: No built-in scripting/automation workflow; costly in case you solely convert a number of paperwork a yr. Bottom line: If you don’t have to script or automate quite a lot of conversions and don’t thoughts paying $24 per yr, this can be a nice alternative.Amazon AWS TextractFor an AWS cloud service, Textract is surprisingly simple to make use of. While you actually can undergo the standard multi-step AWS setup and coding course of for Textract, Amazon additionally provides a drag-and-drop net demo that permits you to obtain outcomes as zipped CSVs. You simply want to enroll in a (free) Amazon AWS account.Rating: Excellent — this was our greatest choice for an advanced scanned PDF.Cost: 1.5 cents per web page (100 pages monthly free on your first three months at AWS)Pros: Best choice examined for an advanced scanned PDF; carried out extraordinarily nicely on all of the app-generated PDFs; provides a alternative of viewing outcomes with merged or unmerged cell format; simple to make use of; inexpensive.Cons: Uploaded information are restricted to 10 pages at a time. For those that need to automate, utilizing this API is extra difficult than another choices.Bottom line: An glorious alternative in case you don’t thoughts the AWS setup and both handbook add or coding with a fancy API.TabulaIf you’re on the lookout for free and open supply, give Tabula a strive. Unlike some free choices from the Python world, Tabula is simple each to put in and to make use of. And it has each a command-line and a browser interface, making it equally helpful for batch conversions and point-and-click use.Tabula did very nicely on PDFs of low or average complexity, though it did have a difficulty with the advanced one (as did lots of the paid platforms). Tabula requires a separate Java set up on Windows and Linux.Rating: Very good — and you’ll’t beat the worth.Cost: FreePros: Free; simple to put in; has each a GUI and scripting choices; lets you manually change what areas of the web page ought to be analyzed for tables; can save outcomes as a CSV, TSV, JSON, or script; provides two completely different knowledge extraction strategies.Cons: Needed some handbook knowledge cleanup on advanced formatting; works on app-generated PDFs solely.Bottom line: A sensible choice if price, ease of use, and automation choices are excessive in your listing of desired options and your PDFs aren’t scanned.PDFTablesA key benefit to this service is automation. Its API is nicely documented and helps every little thing from Windows PowerShell and VBA (Office Visual Basic for Applications) to programming languages like Java, C++, PHP, Python, and R.PDFTables carried out nicely on many of the app-generated PDF tables, even understanding {that a} two-column header could be finest as a single-column header row. It did have some problem with knowledge in columns that have been largely empty but additionally had some knowledge in cells unfold over two traces. And whereas it choked on the scanned nightmare PDF, not less than it didn’t cost me for that.Rating: Very good general; glorious on automation.Cost: 50 pages free at signup — together with API use. After that it’s $40 for as much as 1,000 pages, and your credit are solely good for a yr.Pros: Very good API; higher efficiency on the reasonably advanced PDF than a number of of its paid rivals.Cons: Pricey, particularly in case you use greater than the 50 free pages however lower than 1,000 web page conversions in a yr. Doesn’t work on scanned PDFs.Bottom line: Performs nicely and is simple to make use of each on the net and thru scripting and programming. If you don’t want a sublime API, nonetheless, it’s possible you’ll choose a inexpensive choice.PDFtoExcel.comThis is a freemium platform with paid choices. It proved to be the lone free alternative that was in a position to deal with our scanned nightmare PDF.Rating: Good.Cost: Free within the cloud, $5/month or $49/yr premium cloud for batch conversions and quicker service, desktop software program $35 for 30-day use or $150 lifetime.Pros: A variety of functionality for the free choice; works on scanned PDFs; inexpensive.Cons: No API or cloud automation (we didn’t check the desktop software program); paid choice required for batch conversions; break up single-row multi-line knowledge into a number of rows.Bottom line: Nice stability of price and options. This was most compelling for advanced scanned PDFs, however others did higher when cell knowledge ran throughout a number of traces.CometdocsThis web-based service is notable for a number of file format conversions: In addition to producing Excel, it could obtain outcomes as Word, PowerLevel, AutoCAD, HTML, OpenOffice, and others. Free accounts can convert as much as 5 information per week (30MB every); paid customers get a limiteless variety of conversions (2GB/day knowledge restrict).Cometdocs is a supporter of public service journalism; the service provides free premium accounts to Investigative Reporters & Editors members (disclosure: I’ve one).Rating: Good.Cost: 5 free conversions/week; in any other case $10/month, $70/yr or $130 “lifetime.”Pro: Works on scanned PDFs; a number of enter and output codecs; typically good outcomes; did extraordinarily nicely on a 2-page PDF with advanced desk format.Cons: Not as strong on advanced scanned PDFs as another choices; break up one row’s multi-line knowledge into a number of rows; no clear script/automation choice.Bottom line: Particularly compelling in case you’re serious about a number of format exports and never simply Excel.Microsoft ExcelMany individuals don’t know that Excel can import PDFs straight — however provided that you’ve received a Microsoft 365 or Office 365 subscription on Windows. It was a good selection for the easy file however received extra cumbersome to make use of as PDF complexity rose. It’s additionally prone to be complicated to individuals who aren’t acquainted with Excel’s Power Query / Get & Transform interface.How to import a PDF straight into Excel: In the Ribbon toolbar, go to Data > Get Data > From File > From PDF and choose your file. For a single desk, you’ll seemingly have one option to import. Select it and you must see a preview of the desk and an choice to both load it or rework the info earlier than loading. Click Load and the desk will pop into your Excel sheet.For a single desk on one web page, this can be a fast and fairly easy alternative. If you could have a number of tables in a multi-page PDF, this additionally works nicely — so long as every desk is confined to 1 web page. Things get a bit extra advanced in case you’ve received one desk over a number of PDF pages, although, and also you’ll want data of Power Query instructions.It’s considerably unfair to match Power Query knowledge transformation with the opposite instruments, because the outcomes of any of those different PDF to Excel converters is also imported into Excel for Power Query wrangling.Rating: Good.Cost: Included in a Microsoft 365/Office 365 Windows subscription.Pro: You don’t have to go away Excel to take care of the file; quite a lot of built-in knowledge wrangling obtainable for many who know Power Query.Cons: Complex to make use of in contrast with most others on all however the easiest of PDFs; doesn’t work on scanned PDFs; requires a Microsoft 365/Office 365 subscription on Windows.Bottom line: If you’ve already received Microsoft 365/Office 365 on Windows and also you’ve received a easy conversion process, Excel is price a strive. If you already know Power Query, positively take into account this for extra PDF conversions! (If you don’t, Power Query is a good talent to study for Excel customers usually.) If your PDF is tougher and also you don’t already use Power Query / Get & Transform, although, you’re most likely higher off with another choice.PDF to Excel check resultsHere’s how the seven instruments fared in our 4 conversion checks:1. Simple PDF to Excel problemOur “simple” process was a single-page app-generated PDF pulled from web page 5 of a Boston housing report. It contained one desk and a few textual content, however column headers and two knowledge cells did embody wrapped textual content over two traces.All the platforms we examined dealt with this one nicely. However, a number of broke up the multi-line textual content into a number of rows. The difficulty was simple to identify and repair on this instance, however this difficulty may very well be tough in bigger information. For this simple one-pager, although, the PDF to Excel converters that weren’t in first or second place nonetheless had excellent outcomes. All have been price utilizing for the sort of conversion.First place: Tie — Adobe and AWS Textract. With Adobe, no knowledge cleanup was wanted. The column headers even had the colour formatting of the unique. Adobe’s conversion included textual content (with pretty formatting), which is helpful if you wish to maintain written explanations along with the info in Excel. You’d have to delete the textual content manually if you would like knowledge solely, however that’s easy sufficient.AWS Textract transformed knowledge solely. No knowledge cleanup was wanted.Close second: Excel. Data solely. Excel didn’t break wrapped textual content into two rows, but it surely did seem to run textual content collectively with no area with multi-line rows. The knowledge was truly right, although, while you checked out it within the system bar — it simply seemed fallacious within the general spreadsheet. This was simply fastened by formatting cells with “wrap text.” However, not everybody may know to do this when taking a look at their spreadsheet.Others:PDFTables: returned knowledge and textual content. Same points as Excel with showing to maintain wrapped textual content in a single line with no area between phrases. This was additionally simply fastened by wrapping textual content, in case you knew to take action. This consequence additionally would wish cleanup of a few phrases from a brand that appeared under the info. Explanatory textual content exterior the brand had no issues, although.Tabula: knowledge solely. Split multi-line cells into a number of rows.Cometdocs: knowledge and textual content. Split multi-line cells into a number of rows. Surrounding textual content was correct, together with brand textual related efficiency to Cometdocs.2. Moderate PDF to Excel problemOur average PDF problem was a single app-generated desk spanning a number of PDF pages, through the Boston-area Metropolitan Water Resources Authority knowledge monitoring wastewater for Covid-19 traces.First place: Adobe. One of the few to acknowledge that each one the pages have been the identical desk, so there have been no clean rows between pages. Headers have been in a single row and areas between phrases within the column names have been maintained. Data construction was glorious, together with retaining the multi-line wrap as is. It even reproduced background and textual content colours. The 11-page size wasn’t an issue.Second: AWS Textract. Header row was right. Each web page got here again as a separate desk, though it could be simple sufficient to mix them. The one unusual difficulty: There have been apostrophes added originally of the cells — probably because of how I break up the PDF, since I wanted to create a file with solely 10 pages. However, these apostrophes have been simple to see and take away with a single search and exchange, because the knowledge didn’t embody any phrases with apostrophes. It was simpler to get the precise knowledge I wanted than with Tabula, however extra cumbersome to get the complete knowledge set.Close third: Tabula. No clean rows between pages, knowledge within the right columns, wrapped cells stayed in a single row. Unfortunately, whereas the wrapped knowledge appeared correctly while you seemed on the cell contents within the system bar, as soon as once more the info appeared to merge collectively within the full spreadsheet — and this wasn’t as simply fastened by formatting with textual content wrapping as with Excel and PDFTables within the easy PDF.For instance, this was the content material of 1 cell because it appeared within the system bar:B.1.1.776%But within the general spreadsheet, that very same cell seemed likeB.1.1.776%I used to be in a position to get that to show correctly at instances by rising the row top manually, however this was an added step that most individuals wouldn’t know to do, and it didn’t appear to work on a regular a number of issues. The first few pages have been advantageous apart from multi-row headers, however knowledge over two traces in single cells broke into two rows within the knowledge, producing clean rows elsewhere that will must be fastened. In addition, columns have been shifted to the fitting in a single part. This would wish cleanup.PDFTables: a number of issues. All the info got here in advantageous for many of the pages, however towards the tip, a number of cells that ought to have been in column J received merged with column I in ways in which could be harder to repair than PDFtoExcel’s. For instance, this single cell:Omicron559 23%Was alleged to be 559 in a single cell and Omicron 23% within the subsequent cell.Cometdocs: failed. Conversion failed on the complete PDF and even the 10-page model I uploaded to AWS. It was in a position to convert a model with simply the primary 5 pages, however the full file ought to have been nicely under Cometdoc’s account limits.Excel: it was attainable to get the info in a format I needed, but it surely required knowledge manipulation in Power Query in addition to wrapping textual content. That’s not a good comparability with different platforms that have been a single add or command. Still, outcomes have been finally glorious. If you’re an Excel/Power Query energy consumer, this can be a good selection.3. Complex PDF to Excel challengeLocal election outcomes are a few of my favourite examples of analysis-hostile public knowledge. The app-generated PDF from Framingham, Mass. proven under was solely 3 pages however with desk formatting that was not designed for ease of information import. Is there a PDF conversion device that may deal with it? IDG

    Page 1 of the PDF displaying current election outcomes for Framingham, Mass. (Click picture to enlarge it.)

    First place: Tie — Adobe and PDF to Excel. Adobe returned an Excel file in excellent format, full with unique cell colours.While’s spreadsheet didn’t have the beautiful formatting of Adobe, all the info got here in precisely, and it was usable as is.Others:AWS Textract: honest. Results got here again in 5 tables. In one case, you’d want to repeat and paste them collectively manually and take a look at the unique to be sure you have been doing so appropriately.PDFTables: poor. Data got here again, however some within the fallacious columns, whether or not I attempted to obtain as a number of sheets or one sheet. This would wish handbook checking and cleanup.Tabula: poor. Similar drawback as PDFTables with some knowledge within the fallacious columns, however not less than I didn’t should pay for it. I attempted each the Stream and Lattice extraction strategies, and each had some wrong-column points (though the problems have been completely different).Cometdocs: conversion failed.4. Nightmare PDF to Excel problemOur nightmare comes courtesy of a presentation at this yr’s National Institute for Computer Assisted Reporting convention, for example of information that will be helpful for coaching college students — if it was in a format that may very well be simply analyzed. It’s a multi-page scanned PDF with 4 months of information from the federal Refugee Processing Center on refugee arrivals by nation of origin and U.S. state of vacation spot.This PDF’s challenges vary from multi-page tables to numerous merged columns. In addition, the desk on web page 1 proved to be considerably completely different than tables on the opposite pages, not less than when it comes to how a number of instruments have been in a position to deal with them, though they appear the identical.I solely examined the primary 10 pages because of the AWS 10-page restrict, to be honest to all of the instruments.First place: AWS Textract. By far the perfect of the group. This is the place subtle machine studying is a bonus. Results are obtain as a zipped file of a number of CSVs, one for every web page. Instead of manually importing these information one after the other into Excel, although, you’ll be able to go to Data > Get Data > From File > From Folder and choose the folder with these newly unzipped CSVs (with no few further information, equivalent to one with metadata and one other with all of the textual content in textual content format).I additionally needed to rename file table-10 to table-910, as a result of the import was placing table-10 proper after table-1 and earlier than table-2 (sorting the filenames alphabetically and never -9 after -10). While I didn’t test all the info, the rows I spot-checked have been all correct. This was the best option to gather all of the pages right into a single, usable spreadsheet.Second: Tie — Cometdocs and Cometdocs had unexplained “Grand” traces on every of the pages — a single row simply with “Grand” within the first column — however in any other case the info seemed excellent. And it got here in a single file.Including textual content on every web page may very well be a profit or disadvantage, relying in your wants. Context is helpful, however combining the info into one desk would positively take extra’s efficiency was much like Cometdocs — together with the additional “Grand” traces.Others: Adobe: honest. While quite a lot of the info was advantageous, outcomes merged some knowledge on the primary web page and would wish handbook checking and cleanup.PDFTables: doesn’t work on scanned PDFs with out utilizing OCR software program first — though not less than it doesn’t cost account credit when tables aren’t detected.Tabula: doesn’t work on scanned PDFs with out utilizing OCR software program first.PDF conversion instruments we didn’t checkThere are quite a few different helpful instruments on the market that require a bit extra setup work or some important coding with a view to flip your PDF into knowledge you’ll be able to analyze. If not one of the ones we examined give you the results you want, although, listed here are a number of others:Excalibur: I obtained a tip about this one, an internet interface constructed for the Camelot Python library. However, it entails putting in a number of dependencies, and that set up could also be a turn-off for individuals who don’t have already got Python expertise (regardless that you don’t have to run Python with a view to use it).Google Cloud Document AI: Google Cloud’s Document AI setup is considerably extra concerned than utilizing AWS Textract. Plus, whilst you can add a check doc as much as 5 pages to see what’s extracted, I didn’t see an apparent option to obtain the outcomes by the online interface as a CSV or Excel file — the one choice was JSON. You’d most likely want a programming language like Python or R to make use of this successfully.Microsoft Azure Form Recognizer: This can also be extra advanced to arrange than AWS Textract, however in case you’re already an Azure consumer, it is most likely price a glance. 500 free pages monthly.

    Copyright © 2022 IDG Communications, Inc.

    Recent Articles

    How will we know when an AI actually becomes sentient? | Digital Trends

    Google senior engineer Blake Lemoine, technical lead for metrics and evaluation for the corporate’s Search Feed, was positioned on paid go away earlier this...

    Lego Brawls’ Delightful Chaotic Sets It Apart From Smash Bros.

    Lego Brawls seems to be prefer it belongs...

    PS Plus Extra finally gives me a reason to boot up my PS5 | Digital Trends

    Due to the character of my job, I personal each trendy online game console. I've a PlayStation 5, Xbox Series X, Nintendo Switch, PC,...

    Related Stories

    Stay on op - Ge the daily news in your inbox