Get Text Value of Content Item

PDF-XChange Editor SDK for Developers

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Forum rules
DO NOT post your license/serial key, or your activation code - these forums, and all posts within, are public and we will be forced to immediately deactivate your license.

When experiencing some errors, use the IAUX_Inst::FormatHRESULT method to see their description and include it in your post along with the error code.
Post Reply
jeffp
User
Posts: 914
Joined: Wed Sep 30, 2009 6:53 pm

Get Text Value of Content Item

Post by jeffp »

I'm trying to get the text value from a IPXC_ContentItem of type CIT_Text.

I'm trying to do something like the following in Delphi but can't seem to bet the conversion right. It just give me the first character of the text item.

cItem.Text_GetTextLen(ALen);
cITem.Text_GetData(AByte, ALen);
AText := Chr(AByte);

Can you help me out with the right Delphi conversion to get the full text value of the item.

Thanks.

--Jeff
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get Text Value of Content Item

Post by Sasha - Tracker Dev Team »

Hello Jeff,

We'll try to use that method in Delphi and hopefully will answer with the solution asap.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get Text Value of Content Item

Post by Sasha - Tracker Dev Team »

Hello Jeff,

Converting this byte array to string will give you nothing (though in the most English-only documents it will hold the Ansi string bytes). In the case of the Unicode - the result would be unpredictable.

Code: Select all

doc.CoreDoc.Pages.Get_Item(nPage, APage);
APage.GetContent(CAccessMode_Readonly, AContent);
AContent.Get_Items(AContentItems);
AContentItems.Get_Count(ACount);
for i := 1 to ACount do
begin
AContentItems.Get_ItemType(i - 1, AType);
if (AType = CIT_Text) then
begin
  AContentItems.Get_Item(i - 1, ACItem);
  ACItem.Text_GetTextLen(ALen);
  GetMem(AByte, ALen);
  try
//Here we will only get the array of bytes - if that's what you need that use this
    ACItem.Text_GetData(AByte[0], ALen);
  finally
    FreeMem(AByte);
  end;
  Result := Result + ' ' + AText;
end;
end; 
For correct text extraction this should be used:
https://sdkhelp.pdf-xchange.com/vie ... ge_GetText

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
jeffp
User
Posts: 914
Joined: Wed Sep 30, 2009 6:53 pm

Re: Get Text Value of Content Item

Post by jeffp »

Actually, I think I found a better way in Delphi.

Code: Select all

var
    ABytes: TBytes;
    ALen: Cardinal;
    AText: String;
  begin
    cItem.Text_GetTextLen(ALen);
    SetLength(ABytes, ALen);
    cITem.Text_GetData(ABytes[0], ALen);
    AText := StringOf(ABytes);
One follow up, how can I get the Font name and Font size of a IPXC_ContentItem or type CIT_Text?

--Jeff
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get Text Value of Content Item

Post by Sasha - Tracker Dev Team »

As I previously said - that won't be a valid way for most Unicode type data. That code works only with Ansi coded text meaning that only English language could be evaluated by this method though not guaranteed.
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
jeffp
User
Posts: 914
Joined: Wed Sep 30, 2009 6:53 pm

Re: Get Text Value of Content Item

Post by jeffp »

Ok. I'll look at GetText.

But how can I get the Font name and Font size?

--Jeff
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get Text Value of Content Item

Post by Sasha - Tracker Dev Team »

Hello Jeff,

Well after you use that method you will have this:
https://sdkhelp.pdf-xchange.com/vie ... C_PageText
And then for each char you'll have the font's handle by which you can find the font that you need. Also you can get another char properties.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
jeffp
User
Posts: 914
Joined: Wed Sep 30, 2009 6:53 pm

Re: Get Text Value of Content Item

Post by jeffp »

Ok. I realize that the I'm limited to AnsiChars if I go the IPXC_ContentItem.Text_GetData() route, which is fine since my documents will only contains AnsiChars. The reason I like this route better is that the IPXC_ContentItem.Get_BBox(ARect) coordinates are much better than those I get using IPXC_PageText.Get_CharRect(i, ARect);

However, I can't seem to figure out a way to get the font size of a IPXC_ContentItem. I can get the font size with IPXC_PageText.Get_CharStyle, but the IPXC_PageText approach gives me less accurate placement coordinates.

So is there a way to get the font size from a IPXC_ContentItem?
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get Text Value of Content Item

Post by Sasha - Tracker Dev Team »

Hello Jeff,

As I previously said that the output for the Text_GetData() is byte array and calling it the Ansi string is not entirely correct.
If you want to use this method than you'll have to use the Current Transformation Matrix (CTM) and Text Matrix (TM) and calculate the font size by yourself.
You will need to read the PDF Specification (these chapters):
8.3 Coordinate System
8.4 Graphics State
9.4 Text Objects

Using the GetTState method you will get the CurFont where you can get Font Name (as you previously asked). Also you will get the TM from the GetTState method.
The CTM can be obtained from the IPXC_ContentItem.
Then, using the FontSize from the IPXC_TState and TM and CTM matrices and the Chapters from the PDF specification that I mentioned you can calculate the real font size.

All of this you'll need to implement by yourself if you don't want to use the IPXC_PageText.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get Text Value of Content Item

Post by Sasha - Tracker Dev Team »

By the way - here's an example on how the bytes in the byte array can differ based on encoding. And also, the font size from the IPXC_TState differ from the real font size (and needs to undergo the matrices transformation). There are other hidden problems that can occur if you will use this method and not our IPXC_PageText. Though it's up to you to decide.

Cheers,
Alex
Attachments
test.tmp.pdf
(739 Bytes) Downloaded 116 times
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
jeffp
User
Posts: 914
Joined: Wed Sep 30, 2009 6:53 pm

Re: Get Text Value of Content Item

Post by jeffp »

Ok. I see your point and agree that it would be better to use the IPXC_PageText method.

However, using your test.tmp.pdf file as an example, here is what each method produced (word plus PXC_Rect coordinates).

Word, Top, Left, Height, Width
"Hello",707,40,56,115 - IPXC_PageText Method
"Fake!",712,39,47,119 - IPXC_ContentItem Method

You'll note that there is a pretty big difference in the Top and Height coordinates above. The PXC_Rect for IPXC_ContentItem is a much tighter fit and works better if I then try to place the text back into the pdf using the code below.

Code: Select all

      AText := AWord.Text;
      x := AWord.Left;
      y := I2P(AInchesH) - (AWord.Top + AWord.Height);

      CC := FDoc.CreateContentCreator;
      CC.SetTextRenderMode(TRM_Fill); //TRM_None; //TRM_Fill
      CC.SetFont(AFont);
      CC.SetFontSize(AFontSize);
      CC.SetStrokeColorRGB(RGB(0, 0, 0));
      CC.ShowTextLine(x, y, PChar(AText), -1, STLF_Baseline); //Use Baseline
      CC.Detach(AContent);
      APage.PlaceContent(AContent, PlaceContent_After);

Thoughts?

--Jeff
User avatar
Vasyl-Tracker Dev Team
Site Admin
Posts: 2352
Joined: Thu Jun 30, 2005 4:11 pm
Location: Canada

Re: Get Text Value of Content Item

Post by Vasyl-Tracker Dev Team »

Hi, Jeff.

If you want to print new text in the same vertical position that is for existing text line(obtained from IPXC_PageText) - then best way is:

// firstCharIdx - the index of first char of necessary word

int lineIdx = pageText.CharLineIndex[firstCharIdx];
PXC_Matrix ln2pgMatrix = pageText.LineInfo[lineIdx].Matrix; // this matrix contains all transformations for that line on the page(offset, rotation, scale etc), the line's own baseline is located in [0,0]
float xPos = pageText.CharExtra[firstCharIdx].xPos;
float yPos = 0.0; // because correct and absolute transformation will be applied by line's matrix

CC.SaveState();
CC.ConcatCS(ln2pgMatrix);
// apply line's matrix
...
CC.SetFont(Font);
CC.SetFontSize(FontSize);
CC.SetStrokeColorRGB(RGB(0, 0, 0));
CC.ShowTextLine(xPos, yPos, "Sample Text", -1, STLF_Baseline); //Use Baseline
...
CC.RestoreState();

---

Unfortunately, currently is no way to get the simple unicode text from content-item directly. We will add this feature in next build.
Info: in the near future we will provide special technical build for sdk for public testing/usage. It will be daily updated. With that you will be able to try new stuffs and get fixes without waiting for official release.

HTH.
Vasyl Yaremyn
Tracker Software Products
Project Developer

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
DolphinMann
User
Posts: 158
Joined: Mon Aug 04, 2014 7:34 pm

Re: Get Text Value of Content Item

Post by DolphinMann »

Sorry to necro an old thread but I am attempting to accomplish the same thing. Was the unicode extraction via the content item every completed? Do you have any good links to dealing with extraction or creation of text?
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get Text Value of Content Item

Post by Sasha - Tracker Dev Team »

Hello DolphinMann,

I will see whether this was implemented and will reply here.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
DolphinMann
User
Posts: 158
Joined: Mon Aug 04, 2014 7:34 pm

Re: Get Text Value of Content Item

Post by DolphinMann »

Another question as well. I see that I can find the characters, lines, blocks, paragraphs, but is it possible to find words?

I did see a "TCF_WordBegin" under the character flags, but no word end. Would I just need to loop through and find the "TCF_WordBegin" markers and then continue looping until I reach a separator or some type? Is there an easier/better way to do this to find "words"?
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get Text Value of Content Item

Post by Sasha - Tracker Dev Team »

Hello DolphinMann,

The TCF_WordBegin flag is not yet implemented. You can use the JS to get the number of words and the Nth word on page:
http://help.adobe.com/en_US/acrobat/acr ... rhsyns=%20

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
Post Reply