Text export in different order...

A forum for questions or concerns related to the PDF-XChange Core API SDK

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

Forum rules
DO NOT post your license/serial key, or your activation code - these forums, and all posts within, are public and we will be forced to immediately deactivate your license.

When experiencing some errors, use the IAUX_Inst::FormatHRESULT method to see their description and include it in your post along with the error code.
Post Reply
lidds
User
Posts: 510
Joined: Sat May 16, 2009 1:55 pm

Text export in different order...

Post by lidds »

I am using the following code to export the text in a PDF to text file.

Code: Select all

Dim myDoc As PDFXCoreAPI.IPXC_Document = g_Inst.OpenDocumentFromFile(Me.TextBox1.Text, Nothing)

        Try
            Dim bHasDoc As Boolean = myDoc IsNot Nothing
            Dim docStringBuilder As New StringBuilder
            If bHasDoc Then
                For pageNum As UInteger = 0 To CUInt(myDoc.Pages.Count - 1)
                    Dim curPage As IPXC_Page = myDoc.Pages(pageNum)

                    Dim MyPageText As IPXC_PageText
                    MyPageText = curPage.GetText(Nothing, False)

                    Dim FirstChar As UInteger = 0
                    Dim CharCount As UInteger = 0

                    For i As UInteger = 0 To CUInt(MyPageText.LinesCount - 1)
                        FirstChar = MyPageText.LineInfo(i).nFirstCharIndex
                        CharCount = MyPageText.LineInfo(i).nCharsCount
                        Dim pdfWord As String = Regex.Replace(MyPageText.GetChars(FirstChar, CharCount), " {2,}", " ")
                        docStringBuilder.AppendLine(pdfWord)
                    Next
                Next
            End If

            Dim file As New System.IO.StreamWriter("C:\temp\PDFExport.txt", False)
            file.WriteLine(docStringBuilder.ToString())
            file.Close()

            docStringBuilder.Clear()
        Catch ex As Exception
            Console.WriteLine(ex)
        End Try
The issue that I have is that sometimes the text in the export text file does not seem to be in order, please see screen shot below:
TextInWrongOrder.png
Is there a way to resolve this, or a way that I can maybe use the text line Y position to output a correctly ordered text file?

Thanks in advance

Simon
lidds
User
Posts: 510
Joined: Sat May 16, 2009 1:55 pm

Re: Text export in different order...

Post by lidds »

I was just wondering if someone from the support team could get back to me on this.

Thanks

Simon
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Text export in different order...

Post by Sasha - Tracker Dev Team »

Hello Simon,

Sort the lines via their Y position and then do the conversion - the visual text appearance and the placement in the content are different.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
lidds
User
Posts: 510
Joined: Sat May 16, 2009 1:55 pm

Re: Text export in different order...

Post by lidds »

Hi Alex,

Thanks for the answer.

Being a bit lazy as away from my computer, but wanted to work on this over the weekend. Do you have an example of how to get the Y position of each line of text?

Normally you know I would attempt the code first.

Thanks

Simon
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Text export in different order...

Post by Sasha - Tracker Dev Team »

Hello Simon,

Here's a hint:
https://sdkhelp.pdf-xchange.com/vie ... t_LineInfo

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
lidds
User
Posts: 510
Joined: Sat May 16, 2009 1:55 pm

Re: Text export in different order...

Post by lidds »

Alex,

I have been trying to solve this issue, but for some reason the position do not seem to match the page positions? Is this possible?

I am using the below code:

Code: Select all

Dim nowTime As DateTime = DateTime.Now

        Console.WriteLine("Start: " & nowTime.ToLongTimeString & ":" & nowTime.Millisecond.ToString)
        Dim myDoc As PDFXCoreAPI.IPXC_Document = g_Inst.OpenDocumentFromFile(Me.TextBox1.Text, Nothing)

        Try
            Dim bHasDoc As Boolean = myDoc IsNot Nothing
            Dim docStringBuilder As New StringBuilder
            If bHasDoc Then
                For pageNum As UInteger = 0 To CUInt(myDoc.Pages.Count - 1)
                    Dim curPage As IPXC_Page = myDoc.Pages(pageNum)

                    Dim MyPageText As IPXC_PageText
                    MyPageText = curPage.GetText(Nothing, False)

                    Dim FirstChar As UInteger = 0
                    Dim CharCount As UInteger = 0

                    For i As UInteger = 0 To CUInt(MyPageText.LinesCount - 1)
                        FirstChar = MyPageText.LineInfo(i).nFirstCharIndex
                        CharCount = MyPageText.LineInfo(i).nCharsCount

                        Dim pdfWord As String = Regex.Replace(MyPageText.GetChars(FirstChar, CharCount), " {2,}", " ")

                        docStringBuilder.AppendLine(pdfWord & " Top: " & MyPageText.LineInfo(i).rcBBox.top.ToString & " Bottom: " & MyPageText.LineInfo(i).rcBBox.bottom.ToString & " Left: " & MyPageText.LineInfo(i).rcBBox.left.ToString & " Right: " & MyPageText.LineInfo(i).rcBBox.right.ToString)
                    Next
                Next
            End If

            Dim file As New System.IO.StreamWriter("C:\temp\PDFExport.txt", False)
            file.WriteLine(docStringBuilder.ToString())
            file.Close()

            docStringBuilder.Clear()
        Catch ex As Exception
            Console.WriteLine(ex)
        End Try

        nowTime = DateTime.Now
        Console.WriteLine("End: " & nowTime.ToLongTimeString & ":" & nowTime.Millisecond.ToString)
And I have attached the PDF I am using along with the text output file.

Thanks

Simon
Attachments
PDFExport.zip
(479 Bytes) Downloaded 99 times
TestScrap.pdf
(11.87 KiB) Downloaded 122 times
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Text export in different order...

Post by Sasha - Tracker Dev Team »

Hello Simon,

Of course the coordinates would be like that in your case - those are the coordinates of text in line's coordinate system. To convert them into the visual coordinate representation, the line and page matrices should be used:

Code: Select all

IPXC_Page page = pdfCtl.Doc.CoreDoc.Pages[0];
IPXC_PageText text = page.GetText(null, false);
PXC_Matrix pageMatrix = page.Matrix;
for (uint i = 0; i < text.LinesCount; i++)
{
	PXC_TextLineInfo li = text.LineInfo[i];
	PXC_Matrix m = li.Matrix;
	m = auxInst.MathHelper.Matrix_Multiply(ref m, ref pageMatrix);
	PXC_Rect rc;
	rc.left = li.rcBBox.left;
	rc.right = li.rcBBox.right;
	rc.top = li.rcBBox.top;
	rc.bottom = li.rcBBox.bottom;
	auxInst.MathHelper.Rect_Transform(m, ref rc);
}
Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
Post Reply