PDF, PPT, DOC, etc to TEXT

Maybe these should be separate questions, one for each format, but...

What are the most RELIABLE libraries (in any language), binaries (for any platform), or webservices (free or not free) for converting diverse "text-containing" formats into plain text?

By reliable, I mean near 100% ability to extract ALL of the human-readable text while NOT EXTRACTING "code" or "markup".

By text-containing formats, I mean: all the most common things like PDF, PPT, DOC, DOCX, RTF, HTML, ".PAGES", ".KEYNOTE", ODT, etc etc

Please suggest both packages/services that support many of these formats as well as those that only support one. In addition, are there software "stacks" that "tie together" many packages/services for the purpose of converting to text?

-------------Problems Reply------------

http://www.filebuzz.com/files/Ascii_Convert/1.html <--This link will take you to a list of converters that can convert a PDF and other types of files to an ASCII format (plain text). For Word documents, you can do this with out a software. For example, for Word documents, when you click 'Save As', it will open up a dialog box that will have a 'Save as Type' drop down list. Select 'Plain Text *.txt' and it will save your file in plain text. Good Luck!

In Java, the Apache Tika toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

If you're using Ruby, take a look at Yomu. It's a wrapper for Apache TIKA and supports a variety of document formats which includes the following:

  • Microsoft Office OLE 2 and Office Open XML Formats (.doc, .docx, .xls, .xlsx, .ppt, .pptx)
  • OpenOffice.org OpenDocument Formats (.odt, .ods, .odp)
  • Apple iWorks Formats
  • Rich Text Format (.rtf)
  • Portable Document Format (.pdf)

You can try Extract Text.

From the description: "Extract text from documents such as PDF and Microsoft Word files. It will save the extracted text in a file. Works with .pdf, .doc, .docx, .xls, .xlsx, .ppt, and many more." Requires Microsoft.NET Framework 4.0.

Category:pdf Views:1 Time:2012-03-02

Related post

  • Indexing .PDF, .XLS, .DOC, .PPT using Lucene.NET 2011-02-05

    I've heard of Lucene.Net and I've heard of Apache Tika. The question is - how do I index these documents using C# vs Java? I think the issue is that there is no .Net equivalent of Tika which extracts relevant text from these document types. UPDATE -

  • How to read pdf, ppt, xl, doc files content into a string in php/python 2010-04-14

    Pls suggest me any inbuilt command or package? --------------Solutions------------- well, it shouldn't be too hard to find something from the net. Here's one for Python called pyPDF. Check PyPi also for such modules. As for reading doc,ppt,xls files,

  • Detect the language of a text is english in PDF or DOC files 2011-08-25

    Requirement is that i want to identify that the text written in PDF or Doc is english or non english. if i got a single word of (turiskh, french,arabic and etc.) have to avoid the whole documnet its urgent plz give me sample code for this functionali

  • export to excel,pdf and doc using Ruby on Rails 2010-03-29

    Could you please help me in exporting files to excel,pdf and .doc format through rails applications?? --------------Solutions------------- For Excel I've used FasterCSV and this post to generate csv files that open in excel very well. You can also cr

  • How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)? 2010-04-06

    Also I want to know how to add meta data while indexing so that i can boost some parameters --------------Solutions------------- Lucene indexes text not files - you'll need some other process for extracting the text out of the file and running Lucene

  • Uploading PDF or .doc and security 2010-05-28

    I have a script that lets the user upload text files (PDF or doc) to the server, then the plan is to convert them to raw text. But until the file is converted, it's in its raw format, which makes me worried about viruses and all kinds of nasty things

  • Export HTML Form to .PDF or .DOC 2011-12-07

    I have a site with some html forms (there are some text inputs, checkboxs, etc..), I need this: when user submit this forms, they will be send by a mail in pdf or doc format with filled forms. So I can print exactly the same what user submitted. What

  • How can we open files like ppt, doc, pps, rtf, etc. in Android? 2011-12-20

    Are there classses in Android that can open different kind of files like pps, ppt, docs, rtf, etc.? Please provide links. --------------Solutions------------- WebView mWebView = (WebView) findViewById( R.id.WebView01); String pdfurl = ""; // Url of p

  • How to add a PDF form field (or a text) and link in the page bottom of a page of an existing PDF document using iTextSharp? 2012-02-01

    I have an existing PDF document named as aa.pdf. This PDF document has 3 pages. I'd like to add a PDF form field (or a text) at the page bottom of the first page in aa.pdf using iTextSharp. Meanwhile, I also hope that the PDF form field added (or the

  • If PDF copy+paste gives garbage text - what is wrong with the PDF source code? 2012-08-29

    I have about a PDF file with data in a table. The problem is if I copy and paste the data it gives garbled text. I have used all tools I could get my hands on and the result is the same. I believe that this is due to embedded fonts and there are embe

  • Cannot right clik print a .ppt doc. 2012-12-13

    Win 7 64, Powerpoint 2010, in Win explorer, when right click a .ppt doc and select print I get a Powerpoint message "windows cannot print due to a problem with the current printer setup." This right click printing works with other file types (word, e

  • Converting MBOX files to PDF or DOC? 2014-02-26

    Is there any way I can convert mbox files to .pdf or .doc files? --------------Solutions------------- MBOX is the file that contains all the email messages for the client and it's already plane text (with attachments using MIME formatting), so you co

  • What Linux/Unix software to use to convert html or pdf to doc? 2009-02-12

    I need to convert css styled (x)html or pdf to doc as accurately as possible and do it on Linux (and if possible also on Mac) from cli. Unfortunately OpenOffice can't handle the layout. Is there any such software or library, commercial of free? Thank

  • How to convert pdf to doc file in java 2010-05-08

    need to convert a pdf file to a doc file. I found different type of example to generate pdf file but not got pdf to doc. --------------Solutions------------- What your asking is actually very difficult I recommend you start here and look for a good p

  • Sample program (or) reference page link details (pdf or doc) for XSOM using parse XSD 2011-01-21

    Pls provide any sample program for XSOM using parse XSD (Java platform) or any reference page link details (pdf or doc) for XSOM using parse XSD? I need to get the attributes and elements(including ref type element) details from the XSD.

  • Javascript to create a PDF from PNG images and text? 2011-01-28

    Google Static Maps API allows PNG images files of a map to be made programatically (example PNG map). Using only javascript and a browser, is there a way to embed the PNG map image into a PDF file, along with some text, that the user can download? Fr

  • How do I get wkhtmltopdf to produce PDFs with selectable and searchable text? 2011-10-06

    I've installed wkhtmltopdf on Mac OS X via homebrew and I've also tried compiling it (along with the patched version of Qt) by hand. In both cases, the PDFs it generates do not contain any selectable, copyable, or searchable text. Instead each page s

  • Rails Paperclip, Multiple of Different Type (PDF, Image, Doc-) 2011-12-07

    There are a lot of good resources out there that show how to create a Rails application with multiple image uploads. Additionally, there are a lot of good resources showing how to use paperclip to upload different file types (PDF, image, .Doc). I'd l

  • Need help to generate report in PDF or Doc using python 2012-03-29

    I want some help to generate a report in PDF/Doc(MS Word) format . I’m not able to find any module to generated report in doc, except “docx” which I’m not able to comprehend. Actually I’m actual task is to generate the report in Doc only, but as I’m

Copyright (C) dskims.com, All Rights Reserved.

processed in 0.249 (s). 11 q(s)