What I want to do is draw a couple of rectangles on a document and then extract the text that is within the boundaries of the rectangles that I have drawn, is this something that is possible?
Also is it possible to somehow name the rectangles so that I know what text is extracted from within certain rectangles (see image)
Thanks
Simon
Extract text from area...
Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan
Forum rules
DO NOT post your license/serial key, or your activation code - these forums, and all posts within, are public and we will be forced to immediately deactivate your license.
When experiencing some errors, use the IAUX_Inst::FormatHRESULT method to see their description and include it in your post along with the error code.
DO NOT post your license/serial key, or your activation code - these forums, and all posts within, are public and we will be forced to immediately deactivate your license.
When experiencing some errors, use the IAUX_Inst::FormatHRESULT method to see their description and include it in your post along with the error code.
- Tracker Supp-Stefan
- Site Admin
- Posts: 17824
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: Extract text from area...
Hello Simon,
With this option selected: Whenever you place a rectangle annotation - it should try to copy the text that is under it, and add it as a "comment" for that annotation.
You can then "summarize comments" - and it will export the comments as well as the text that was copied in their comments section.
A possible way to "name" annotations would be to e.g. change their subject (by default it will be e.g. 'Rectangle' for your annotations, but can be manually changed).
Regards,
Stefan
With this option selected: Whenever you place a rectangle annotation - it should try to copy the text that is under it, and add it as a "comment" for that annotation.
You can then "summarize comments" - and it will export the comments as well as the text that was copied in their comments section.
A possible way to "name" annotations would be to e.g. change their subject (by default it will be e.g. 'Rectangle' for your annotations, but can be manually changed).
Regards,
Stefan
Re: Extract text from area...
Stefan,
Thank you for the information, however is there a way that I can enable this option using code?
Thanks in advance
Simon
Thank you for the information, however is there a way that I can enable this option using code?
Thanks in advance
Simon
- Tracker Supp-Stefan
- Site Admin
- Posts: 17824
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: Extract text from area...
Hi Simon,
Apologies - I didn't see this is in the SDK section of the forums.
Well - you should be able to obtain all the page text using this:
https://sdkhelp.pdf-xchange.com/vie ... C_PageText
And then find which text elements fall within your desired areas with the GetTextQuads method.
Regards,
Stefan
Apologies - I didn't see this is in the SDK section of the forums.
Well - you should be able to obtain all the page text using this:
https://sdkhelp.pdf-xchange.com/vie ... C_PageText
And then find which text elements fall within your desired areas with the GetTextQuads method.
Regards,
Stefan
Re: Extract text from area...
Stefan,
I have been looking at this for some time and am stuck, and basically can't get the GetTextQuads to work.
I am using the following code, which is passed the selected area on the pdf that the user selected along with the page number. So now that I have this I basically want to get the text contained within that rect. However using the following code, gives me a NullReferenceException on the GetTextQuads line of code. Not quite sure as why, because the https://sdkhelp.pdf-xchange.com/vie ... tTextQuads states that pQuads and stBox are output elements, therefore they should be null.
Just a little confused how to achieve this? Any help would be appreciated.
I have been looking at this for some time and am stuck, and basically can't get the GetTextQuads to work.
I am using the following code, which is passed the selected area on the pdf that the user selected along with the page number. So now that I have this I basically want to get the text contained within that rect. However using the following code, gives me a NullReferenceException on the GetTextQuads line of code. Not quite sure as why, because the https://sdkhelp.pdf-xchange.com/vie ... tTextQuads states that pQuads and stBox are output elements, therefore they should be null.
Just a little confused how to achieve this? Any help would be appreciated.
Code: Select all
Private Function getTextInArea(ByVal left As Double, ByVal top As Double, ByVal right As Double, ByVal bottom As Double, ByVal pageNumber As Integer) As String
Dim curPage As IPXC_Page = myDoc.CoreDoc.Pages(pageNumber)
Dim MyPageText As IPXC_PageText
Dim pageMatrix As PXC_Matrix = curPage.Matrix
Dim stBox As PXC_RectF
Dim pQuads As IPXC_QuadsF = Nothing
Dim rcArea As PXC_Rect
rcArea.left = left
rcArea.right = right
rcArea.top = top
rcArea.bottom = bottom
Dim auxInst As PDFXEdit.IUIX_Inst = DirectCast(Me.docPreview.Inst.GetExtension("UIX"), PDFXEdit.IUIX_Inst)
MyPageText = curPage.GetText(Nothing, False)
Dim FirstChar As UInteger = 0
Dim CharCount As UInteger = 0
Dim pdfWord As String = Nothing
For i As UInteger = 0 To CUInt(MyPageText.LinesCount - 1)
FirstChar = MyPageText.LineInfo(i).nFirstCharIndex
CharCount = MyPageText.LineInfo(i).nCharsCount
MyPageText.GetTextQuads(FirstChar, CharCount, pQuads, stBox)
Next
End Function
- Tracker Supp-Stefan
- Site Admin
- Posts: 17824
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: Extract text from area...
Hello Simon,
My colleagues told me that I should ask you to take a look here:
https://www.pdf-xchange.com/forum3 ... 42#p117152
The issue discussed is the same as yours - and unfortunately it is VB6 related.
Regards,
Stefan
My colleagues told me that I should ask you to take a look here:
https://www.pdf-xchange.com/forum3 ... 42#p117152
The issue discussed is the same as yours - and unfortunately it is VB6 related.
Regards,
Stefan
Re: Extract text from area...
Stefan,
Thank you for the link, however this does not really help me that much as this is about cropping pages. It did however explain a bit on how to create a quadF, however I am still unclear as to how to get text within an area.
I have added some more code to my existing code, by trying to use the quadF code in your previous link. The problem is that the GetTextQuads expects an IPXC_QuadsF and therefore I get the following error. However an IPXC_QuadsF does not accept "new"
Value of type 'PDFXEdit.PXC_QuadF' cannot be converted to 'PDFXEdit.IPXC_QuadsF'
Thank you for the link, however this does not really help me that much as this is about cropping pages. It did however explain a bit on how to create a quadF, however I am still unclear as to how to get text within an area.
I have added some more code to my existing code, by trying to use the quadF code in your previous link. The problem is that the GetTextQuads expects an IPXC_QuadsF and therefore I get the following error. However an IPXC_QuadsF does not accept "new"
Value of type 'PDFXEdit.PXC_QuadF' cannot be converted to 'PDFXEdit.IPXC_QuadsF'
Code: Select all
Private Function getTextInArea(ByVal left As Single, ByVal top As Single, ByVal right As Single, ByVal bottom As Single, ByVal pageNumber As Integer) As String
Dim curPage As IPXC_Page = myDoc.CoreDoc.Pages(pageNumber)
Dim MyPageText As IPXC_PageText
Dim pageMatrix As PXC_Matrix = curPage.Matrix
Dim stBox As PXC_RectF
Dim gInst As PDFXEdit.IPXC_Inst = DirectCast(Me.docPreview.Inst.GetExtension("PXC"), PDFXEdit.IPXC_Inst)
Dim auxInst As PDFXEdit.IUIX_Inst = DirectCast(Me.docPreview.Inst.GetExtension("UIX"), PDFXEdit.IUIX_Inst)
MyPageText = curPage.GetText(Nothing, False)
Dim FirstChar As UInteger = 0
Dim CharCount As UInteger = 0
Dim pdfWord As String = Nothing
Dim pQuads As New PXC_QuadF()
pQuads.pt = New PXC_PointF(3) {}
'top>bottom
pQuads.pt(0).x = top
pQuads.pt(0).y = bottom
'lb
pQuads.pt(1).x = left
pQuads.pt(1).y = bottom
'rb
pQuads.pt(2).x = right
pQuads.pt(2).y = bottom
'rt
pQuads.pt(3).x = right
pQuads.pt(3).y = top
'lt
For i As UInteger = 0 To CUInt(MyPageText.LinesCount - 1)
FirstChar = MyPageText.LineInfo(i).nFirstCharIndex
CharCount = MyPageText.LineInfo(i).nCharsCount
MyPageText.GetTextQuads(FirstChar, CharCount, pQuads, stBox)
Next
End Function
-
- User
- Posts: 5522
- Joined: Fri Nov 21, 2014 8:27 am
- Contact:
Re: Extract text from area...
Hello lidds,
As Stefan said:
As for your problem - there is another way of doing that - you can get the page text and run through all of the symbols. Then get https://sdkhelp.pdf-xchange.com/vie ... t_CharRect for each of them and see whether they are in your rectangle.
Cheers,
Alex
As Stefan said:
Meaning that you cant use the PXC_Quad in the VB6.The issue discussed is the same as yours - and unfortunately it is VB6 related.
As for your problem - there is another way of doing that - you can get the page text and run through all of the symbols. Then get https://sdkhelp.pdf-xchange.com/vie ... t_CharRect for each of them and see whether they are in your rectangle.
Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ