- PHP 5.2+
- php_zip.dll for Windows or --enable-zip parameter for Linux.
This technique can be used to create a web crawler and index document files based upon their content. The text data is present in word/document.xml for DOCX and in Content.xml for ODT file. In order to extract the text all we need to do is that get the contents of word/document.xml (for docx file) or content.xml (for odt file) and then display its content after filtering out XML tags present in it.
Create a new PHP file and name it as extract.php and add the following code
open($filename)) { // If successful, search for the data file in the archive if (($index = $zip->locateName($dataFile)) !== false) { // Index found! Now read it to a string $text = $zip->getFromIndex($index); // Load XML from a string // Ignore errors and warnings $xml = DOMDocument::loadXML($text, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING); // Remove XML formatting tags and return the text return strip_tags($xml->saveXML()); } //Close the archive file $zip->close(); } // In case of failure return a message return "File not found"; } echo extracttext($document); ?>
Source: www.botskool.com
No comments:
Post a Comment
Silahkan