Get Document text divided by words

A forum for questions or concerns related to the PDF-XChange Core API SDK

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

Forum rules
DO NOT post your license/serial key, or your activation code - these forums, and all posts within, are public and we will be forced to immediately deactivate your license.

When experiencing some errors, use the IAUX_Inst::FormatHRESULT method to see their description and include it in your post along with the error code.
Post Reply
quietstorm
User
Posts: 25
Joined: Wed Feb 10, 2010 11:37 am

Get Document text divided by words

Post by quietstorm »

Hi all,
in our application we need to have the whole word list with their quads positional information.

In the old Viewer ActiveX SDK (with Delphi source), we did something like this to get words by index

Code: Select all

	GetProperty('Documents[#'+IntToStr(aDocID)+'].Pages['+IntToStr(page)+'].Text.Words['+IntToStr(wordIdx)+'].String', vDataOut, 0);
	GetProperty('Documents[#'+IntToStr(aDocID)+'].Pages['+IntToStr(page)+'].Text.Words['+IntToStr(wordIdx)+'].Offset', vDataOut, 0);
	GetProperty('Documents[#'+IntToStr(aDocID)+'].Pages['+IntToStr(page)+'].Text.Words['+IntToStr(wordIdx)+'].Length', vDataOut, 0);
	GetProperty('Documents[#'+IntToStr(aDocID)+'].Pages['+IntToStr(page)+'].Text.Words['+IntToStr(wordIdx)+'].Quads.Value', vDataOut, 0);
This way we had words text and positional information of the word in the page.

I can't find similar functions in Editor SDK. I read some other answers pointing out different solutions, for example inspecting IPXC_PageText and relative sub-structures to get positional information of the words and then get the chars with GetChars method.

This is an unfeasible solution for us because we need to pre-analyze the whole PDF document to speed up the research in real-time of words under cursor, finding an entire sentence from a position and so on.

Any suggestions? Thanks in advance.
Fabrizio
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get Document text divided by words

Post by Sasha - Tracker Dev Team »

Hello Fabrizio,

What SDK are you planning to use exactly - Editor SDK or Core API?

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
quietstorm
User
Posts: 25
Joined: Wed Feb 10, 2010 11:37 am

Re: Get Document text divided by words

Post by quietstorm »

We're planning to substitute old Viewer ActiveX, so I think it's Editor SDK.
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get Document text divided by words

Post by Sasha - Tracker Dev Team »

Probably this method should work for you:
https://sdkhelp.pdf-xchange.com/vie ... tTextQuads
But, you will need to do the word splitting for yourself for now, as the needed functionality is not yet available.
Basically the code would look like this:

Code: Select all

	IPXC_PageText* pIText = nullptr;
	pPage->GetText(nullptr, VARIANT_FALSE, &pIText);
	//This should be implemented
	const LONG nWordsCount = GetWordsCount(pIText);
	if (!nWordsCount)
		return TRUE;
	vector<DWORD> ranges; 
	if(!ranges.Grow((nWordsCount - 1) * 2))
		return FALSE;

	for (DWORD i = 0; i < (DWORD)nLen; i++)
	{
		LONG nWordPos = 0;
		LONG nWordLen = 0;
		//This should be implemented
		if (IS_DS_SUCCESSFUL(GetWordPos(pIText, i, nWordPos, nWordLen)) && nWordLen)
		{
			ranges.push_back(nWordPos);
			ranges.push_back(nWordLen);
		}
	}
	vector<PXC_QuadF> quads;
	//This should be implemented
	GetTextQuads(pIText, ranges.begin(), ranges.size(), quads, rcBBox);
Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
User avatar
Vasyl-Tracker Dev Team
Site Admin
Posts: 2352
Joined: Thu Jun 30, 2005 4:11 pm
Location: Canada

Re: Get Document text divided by words

Post by Vasyl-Tracker Dev Team »

Hi Fabrizio.

Look for simple example-project that reads and highlights text in pdf - it might be helpful for your case. This example uses the Editor SDK.
RoboReader.gif
The source code is here: https://github.com/tracker-software/PDF ... RoboReader

HTH
Vasyl Yaremyn
Tracker Software Products
Project Developer

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Post Reply