Maybe these should be separate questions, one for each format, but...
What are the most RELIABLE libraries (in any language), binaries (for any platform), or webservices (free or not free) for converting diverse "text-containing" formats into plain text?
By reliable, I mean near 100% ability to extract ALL of the human-readable text while NOT EXTRACTING "code" or "markup".
By text-containing formats, I mean: all the most common things like PDF, PPT, DOC, DOCX, RTF, HTML, ".PAGES", ".KEYNOTE", ODT, etc etc
Please suggest both packages/services that support many of these formats as well as those that only support one. In addition, are there software "stacks" that "tie together" many packages/services for the purpose of converting to text?
http://www.filebuzz.com/files/Ascii_Convert/1.html <--This link will take you to a list of converters that can convert a PDF and other types of files to an ASCII format (plain text). For Word documents, you can do this with out a software. For example, for Word documents, when you click 'Save As', it will open up a dialog box that will have a 'Save as Type' drop down list. Select 'Plain Text *.txt' and it will save your file in plain text. Good Luck!
In Java, the Apache Tika toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.
If you're using Ruby, take a look at Yomu. It's a wrapper for Apache TIKA and supports a variety of document formats which includes the following:
- Microsoft Office OLE 2 and Office Open XML Formats (.doc, .docx, .xls, .xlsx, .ppt, .pptx)
- OpenOffice.org OpenDocument Formats (.odt, .ods, .odp)
- Apple iWorks Formats
- Rich Text Format (.rtf)
- Portable Document Format (.pdf)
You can try Extract Text.
From the description: "Extract text from documents such as PDF and Microsoft Word files. It will save the extracted text in a file. Works with .pdf, .doc, .docx, .xls, .xlsx, .ppt, and many more." Requires Microsoft.NET Framework 4.0.