Laman

Extract text from DOCX or ODT files using PHP

Rqeuired:
- PHP 5.2+
- php_zip.dll for Windows or --enable-zip parameter for Linux.

This technique can be used to create a web crawler and index document files based upon their content. The text data is present in word/document.xml for DOCX and in Content.xml for ODT file. In order to extract the text all we need to do is that get the contents of word/document.xml (for docx file) or content.xml (for odt file) and then display its content after filtering out XML tags present in it.
Create a new PHP file and name it as extract.php and add the following code

open($filename)) {

        // If successful, search for the data file in the archive

        if (($index = $zip->locateName($dataFile)) !== false) {

            // Index found! Now read it to a string

            $text = $zip->getFromIndex($index);

            // Load XML from a string

            // Ignore errors and warnings

            $xml = DOMDocument::loadXML($text, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);

            // Remove XML formatting tags and return the text

            return strip_tags($xml->saveXML());

        }

        //Close the archive file

        $zip->close();

    }

    // In case of failure return a message

    return "File not found";

}

echo extracttext($document);

?>

Source: www.botskool.com

No comments:

Post a Comment

Silahkan