Extract text from area...

PDF-XChange Editor SDK for Developers

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Forum rules
DO NOT post your license/serial key, or your activation code - these forums, and all posts within, are public and we will be forced to immediately deactivate your license.

When experiencing some errors, use the IAUX_Inst::FormatHRESULT method to see their description and include it in your post along with the error code.
Post Reply
lidds
User
Posts: 510
Joined: Sat May 16, 2009 1:55 pm

Extract text from area...

Post by lidds »

What I want to do is draw a couple of rectangles on a document and then extract the text that is within the boundaries of the rectangles that I have drawn, is this something that is possible?

Also is it possible to somehow name the rectangles so that I know what text is extracted from within certain rectangles (see image)
extractText.png
Thanks

Simon
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17824
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Extract text from area...

Post by Tracker Supp-Stefan »

Hello Simon,

With this option selected:
Comments.png
Whenever you place a rectangle annotation - it should try to copy the text that is under it, and add it as a "comment" for that annotation.
You can then "summarize comments" - and it will export the comments as well as the text that was copied in their comments section.
A possible way to "name" annotations would be to e.g. change their subject (by default it will be e.g. 'Rectangle' for your annotations, but can be manually changed).

Regards,
Stefan
lidds
User
Posts: 510
Joined: Sat May 16, 2009 1:55 pm

Re: Extract text from area...

Post by lidds »

Stefan,

Thank you for the information, however is there a way that I can enable this option using code?

Thanks in advance

Simon
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17824
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Extract text from area...

Post by Tracker Supp-Stefan »

Hi Simon,

Apologies - I didn't see this is in the SDK section of the forums.
Well - you should be able to obtain all the page text using this:
https://sdkhelp.pdf-xchange.com/vie ... C_PageText
And then find which text elements fall within your desired areas with the GetTextQuads method.

Regards,
Stefan
lidds
User
Posts: 510
Joined: Sat May 16, 2009 1:55 pm

Re: Extract text from area...

Post by lidds »

Stefan,

I have been looking at this for some time and am stuck, and basically can't get the GetTextQuads to work.

I am using the following code, which is passed the selected area on the pdf that the user selected along with the page number. So now that I have this I basically want to get the text contained within that rect. However using the following code, gives me a NullReferenceException on the GetTextQuads line of code. Not quite sure as why, because the https://sdkhelp.pdf-xchange.com/vie ... tTextQuads states that pQuads and stBox are output elements, therefore they should be null.

Just a little confused how to achieve this? Any help would be appreciated.

Code: Select all

    Private Function getTextInArea(ByVal left As Double, ByVal top As Double, ByVal right As Double, ByVal bottom As Double, ByVal pageNumber As Integer) As String
        Dim curPage As IPXC_Page = myDoc.CoreDoc.Pages(pageNumber)
        Dim MyPageText As IPXC_PageText
        Dim pageMatrix As PXC_Matrix = curPage.Matrix
        Dim stBox As PXC_RectF
        Dim pQuads As IPXC_QuadsF = Nothing

        Dim rcArea As PXC_Rect
        rcArea.left = left
        rcArea.right = right
        rcArea.top = top
        rcArea.bottom = bottom

        Dim auxInst As PDFXEdit.IUIX_Inst = DirectCast(Me.docPreview.Inst.GetExtension("UIX"), PDFXEdit.IUIX_Inst)
        MyPageText = curPage.GetText(Nothing, False)

        Dim FirstChar As UInteger = 0
        Dim CharCount As UInteger = 0
        Dim pdfWord As String = Nothing

        For i As UInteger = 0 To CUInt(MyPageText.LinesCount - 1)
            FirstChar = MyPageText.LineInfo(i).nFirstCharIndex
            CharCount = MyPageText.LineInfo(i).nCharsCount

            MyPageText.GetTextQuads(FirstChar, CharCount, pQuads, stBox)
        Next
    End Function
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17824
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Extract text from area...

Post by Tracker Supp-Stefan »

Hello Simon,

My colleagues told me that I should ask you to take a look here:
https://www.pdf-xchange.com/forum3 ... 42#p117152
The issue discussed is the same as yours - and unfortunately it is VB6 related.

Regards,
Stefan
lidds
User
Posts: 510
Joined: Sat May 16, 2009 1:55 pm

Re: Extract text from area...

Post by lidds »

Stefan,

Thank you for the link, however this does not really help me that much as this is about cropping pages. It did however explain a bit on how to create a quadF, however I am still unclear as to how to get text within an area.

I have added some more code to my existing code, by trying to use the quadF code in your previous link. The problem is that the GetTextQuads expects an IPXC_QuadsF and therefore I get the following error. However an IPXC_QuadsF does not accept "new"

Value of type 'PDFXEdit.PXC_QuadF' cannot be converted to 'PDFXEdit.IPXC_QuadsF'

Code: Select all

Private Function getTextInArea(ByVal left As Single, ByVal top As Single, ByVal right As Single, ByVal bottom As Single, ByVal pageNumber As Integer) As String
        Dim curPage As IPXC_Page = myDoc.CoreDoc.Pages(pageNumber)
        Dim MyPageText As IPXC_PageText
        Dim pageMatrix As PXC_Matrix = curPage.Matrix
        Dim stBox As PXC_RectF

        Dim gInst As PDFXEdit.IPXC_Inst = DirectCast(Me.docPreview.Inst.GetExtension("PXC"), PDFXEdit.IPXC_Inst)
        Dim auxInst As PDFXEdit.IUIX_Inst = DirectCast(Me.docPreview.Inst.GetExtension("UIX"), PDFXEdit.IUIX_Inst)
        MyPageText = curPage.GetText(Nothing, False)

        Dim FirstChar As UInteger = 0
        Dim CharCount As UInteger = 0
        Dim pdfWord As String = Nothing

        Dim pQuads As New PXC_QuadF()
        pQuads.pt = New PXC_PointF(3) {}
        'top>bottom
        pQuads.pt(0).x = top
        pQuads.pt(0).y = bottom
        'lb
        pQuads.pt(1).x = left
        pQuads.pt(1).y = bottom
        'rb
        pQuads.pt(2).x = right
        pQuads.pt(2).y = bottom
        'rt
        pQuads.pt(3).x = right
        pQuads.pt(3).y = top
        'lt

        For i As UInteger = 0 To CUInt(MyPageText.LinesCount - 1)
            FirstChar = MyPageText.LineInfo(i).nFirstCharIndex
            CharCount = MyPageText.LineInfo(i).nCharsCount
            MyPageText.GetTextQuads(FirstChar, CharCount, pQuads, stBox)
        Next
    End Function
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Extract text from area...

Post by Sasha - Tracker Dev Team »

Hello lidds,

As Stefan said:
The issue discussed is the same as yours - and unfortunately it is VB6 related.
Meaning that you cant use the PXC_Quad in the VB6.

As for your problem - there is another way of doing that - you can get the page text and run through all of the symbols. Then get https://sdkhelp.pdf-xchange.com/vie ... t_CharRect for each of them and see whether they are in your rectangle.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
Post Reply