Text export in different order...
Posted: Thu Oct 19, 2017 2:00 pm
I am using the following code to export the text in a PDF to text file.
The issue that I have is that sometimes the text in the export text file does not seem to be in order, please see screen shot below:
Is there a way to resolve this, or a way that I can maybe use the text line Y position to output a correctly ordered text file?
Thanks in advance
Simon
Code: Select all
Dim myDoc As PDFXCoreAPI.IPXC_Document = g_Inst.OpenDocumentFromFile(Me.TextBox1.Text, Nothing)
Try
Dim bHasDoc As Boolean = myDoc IsNot Nothing
Dim docStringBuilder As New StringBuilder
If bHasDoc Then
For pageNum As UInteger = 0 To CUInt(myDoc.Pages.Count - 1)
Dim curPage As IPXC_Page = myDoc.Pages(pageNum)
Dim MyPageText As IPXC_PageText
MyPageText = curPage.GetText(Nothing, False)
Dim FirstChar As UInteger = 0
Dim CharCount As UInteger = 0
For i As UInteger = 0 To CUInt(MyPageText.LinesCount - 1)
FirstChar = MyPageText.LineInfo(i).nFirstCharIndex
CharCount = MyPageText.LineInfo(i).nCharsCount
Dim pdfWord As String = Regex.Replace(MyPageText.GetChars(FirstChar, CharCount), " {2,}", " ")
docStringBuilder.AppendLine(pdfWord)
Next
Next
End If
Dim file As New System.IO.StreamWriter("C:\temp\PDFExport.txt", False)
file.WriteLine(docStringBuilder.ToString())
file.Close()
docStringBuilder.Clear()
Catch ex As Exception
Console.WriteLine(ex)
End Try
Is there a way to resolve this, or a way that I can maybe use the text line Y position to output a correctly ordered text file?
Thanks in advance
Simon