Get full text from PDF
Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan
Forum rules
DO NOT post your license/serial key, or your activation code - these forums, and all posts within, are public and we will be forced to immediately deactivate your license.
When experiencing some errors, use the IAUX_Inst::FormatHRESULT method to see their description and include it in your post along with the error code.
DO NOT post your license/serial key, or your activation code - these forums, and all posts within, are public and we will be forced to immediately deactivate your license.
When experiencing some errors, use the IAUX_Inst::FormatHRESULT method to see their description and include it in your post along with the error code.
-
- User
- Posts: 83
- Joined: Wed Mar 25, 2015 10:15 am
Get full text from PDF
I have this code:
IPXC_Document MydocSource = MyPXC.OpenDocumentFromFile(lsSourceFile, clbk);
for (int i = 0; i < MydocSource.Pages.Count; i++)
{
IPXC_PageText MyPageText = MydocSource.Pages[(uint)i].GetText(null);
Docinfo2 = Docinfo2 + " " + MyPageText.GetChars(0, MyPageText.CharCount);
}
It gets the text from all of the pages. The only thing is that there are no cariage returns at the end of a line.
How can i solve this?
IPXC_Document MydocSource = MyPXC.OpenDocumentFromFile(lsSourceFile, clbk);
for (int i = 0; i < MydocSource.Pages.Count; i++)
{
IPXC_PageText MyPageText = MydocSource.Pages[(uint)i].GetText(null);
Docinfo2 = Docinfo2 + " " + MyPageText.GetChars(0, MyPageText.CharCount);
}
It gets the text from all of the pages. The only thing is that there are no cariage returns at the end of a line.
How can i solve this?
-
- User
- Posts: 5522
- Joined: Fri Nov 21, 2014 8:27 am
Re: Get full text from PDF
Hello Tom,
The correct way of using the IPXC_PageText in your case would be to read each character separately:
https://sdkhelp.pdf-xchange.com/vie ... eText_Char
And then look for the TFC_LineBegin char flag for the correct new line monitoring:
https://sdkhelp.pdf-xchange.com/vie ... _CharFlags
Also note that there can be a null symbols in the text.
Cheers,
Alex
The correct way of using the IPXC_PageText in your case would be to read each character separately:
https://sdkhelp.pdf-xchange.com/vie ... eText_Char
And then look for the TFC_LineBegin char flag for the correct new line monitoring:
https://sdkhelp.pdf-xchange.com/vie ... _CharFlags
Also note that there can be a null symbols in the text.
Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
-
- User
- Posts: 83
- Joined: Wed Mar 25, 2015 10:15 am
Re: Get full text from PDF
Any sample code?
How could it be so complex to just get a text of a PDF??
How could it be so complex to just get a text of a PDF??
-
- Site Admin
- Posts: 677
- Joined: Thu Jun 28, 2007 8:42 am
Re: Get full text from PDF
Hi Tom.
Also please not that text in PDF file can contain any arbitrary character codes, like null-terminating characters, carriage returns and so on anywhere in line, so you need to filter them too.
HTH.
You may do it yourself faster. Just add one more loop to get each character and its flags. First character of each line will have TCF_LineBegin flag set.Any sample code?
Also please not that text in PDF file can contain any arbitrary character codes, like null-terminating characters, carriage returns and so on anywhere in line, so you need to filter them too.
This is because nature of the PDF - it does not contain text lines as you expect. You may check specification yourself.How could it be so complex to just get a text of a PDF??
HTH.
Victor
Tracker Software
Project manager
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Tracker Software
Project manager
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
-
- User
- Posts: 83
- Joined: Wed Mar 25, 2015 10:15 am
Re: Get full text from PDF
2 other questions concerning th get TEXT method:
1. is there a way to control the order of how the characters are looped. It seems like he's looping from bottom right to top left?
(The IPXC_GetPageTextOptions parameter is not really documented)
2. Could you help me with the coordinates from get_CharRect. What are the dimensions? Pixels, mm?
1. is there a way to control the order of how the characters are looped. It seems like he's looping from bottom right to top left?
(The IPXC_GetPageTextOptions parameter is not really documented)
2. Could you help me with the coordinates from get_CharRect. What are the dimensions? Pixels, mm?
-
- User
- Posts: 5522
- Joined: Fri Nov 21, 2014 8:27 am
Re: Get full text from PDF
Hello Tom,
1. Please provide a piece of your code so that we can analyze it and assist further.
2. The get_CharRect returns the character rectangle in the page points.
Cheers,
Alex
1. Please provide a piece of your code so that we can analyze it and assist further.
2. The get_CharRect returns the character rectangle in the page points.
Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
-
- User
- Posts: 83
- Joined: Wed Mar 25, 2015 10:15 am
Re: Get full text from PDF
for (int i = 0; i < MydocSource.Pages.Count; i++)
{
if (i>0){
Docinfo2 += System.Environment.NewLine;
}
IPXC_PageText MyPageText = MydocSource.Pages[(uint)i].GetText(IPXC_GetPageTextOptions.);
for (uint j = 0; j < MyPageText.CharsCount; j++)
{
if ((MyPageText.get_CharFlags(j) == (uint)PXC_TextCharFlags.TCF_LineBegin) & ( j>1 ) )
{
Docinfo2 += System.Environment.NewLine;
}
Docinfo2 += MyPageText.GetChars(j,1);
}
}
{
if (i>0){
Docinfo2 += System.Environment.NewLine;
}
IPXC_PageText MyPageText = MydocSource.Pages[(uint)i].GetText(IPXC_GetPageTextOptions.);
for (uint j = 0; j < MyPageText.CharsCount; j++)
{
if ((MyPageText.get_CharFlags(j) == (uint)PXC_TextCharFlags.TCF_LineBegin) & ( j>1 ) )
{
Docinfo2 += System.Environment.NewLine;
}
Docinfo2 += MyPageText.GetChars(j,1);
}
}
-
- User
- Posts: 5522
- Joined: Fri Nov 21, 2014 8:27 am
Re: Get full text from PDF
Have you tried using the default behavior?
Code: Select all
PDFXEdit.IPXC_PageText pText = page.GetText(null, false);
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
-
- User
- Posts: 83
- Joined: Wed Mar 25, 2015 10:15 am
Re: Get full text from PDF
That was the code I was using before. And thats the code thats reads the PDF from bottom to top...
I modified the code to check what's inside the parameters.
I modified the code to check what's inside the parameters.
-
- User
- Posts: 5522
- Joined: Fri Nov 21, 2014 8:27 am
Re: Get full text from PDF
Well the GetChars method gets the characters by the order that they were added. It seems you are using the document that has such structure.
Try using these:
https://sdkhelp.pdf-xchange.com/vie ... locksCount
https://sdkhelp.pdf-xchange.com/vie ... _BlockInfo
Then by having the TextBlockInfo you can get the ParaInfo from it:
https://sdkhelp.pdf-xchange.com/vie ... o_ParaInfo
From the paragraph info you can get the information about the lines in the paragraph. Then you can use this method:
https://sdkhelp.pdf-xchange.com/vie ... t_LineInfo
Then having the information about the line, you can get the character indexes from it and construct your resulting string by using the GetChars method.
Cheers,
Alex
Try using these:
https://sdkhelp.pdf-xchange.com/vie ... locksCount
https://sdkhelp.pdf-xchange.com/vie ... _BlockInfo
Then by having the TextBlockInfo you can get the ParaInfo from it:
https://sdkhelp.pdf-xchange.com/vie ... o_ParaInfo
From the paragraph info you can get the information about the lines in the paragraph. Then you can use this method:
https://sdkhelp.pdf-xchange.com/vie ... t_LineInfo
Then having the information about the line, you can get the character indexes from it and construct your resulting string by using the GetChars method.
Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
-
- User
- Posts: 83
- Joined: Wed Mar 25, 2015 10:15 am
Re: Get full text from PDF
Sorry but the blocks are also in the wrong order...
-
- User
- Posts: 5522
- Joined: Fri Nov 21, 2014 8:27 am
Re: Get full text from PDF
Hello Tom,
We provide the information of the paragraphs', lines' and characters' bound boxes. My previous post describes how to get it. Judging by the files that you are using you will have to use the provided coordinates and sort them out manually. Then you can have the result you require for all of the files that you can come across.
Cheers,
Alex
We provide the information of the paragraphs', lines' and characters' bound boxes. My previous post describes how to get it. Judging by the files that you are using you will have to use the provided coordinates and sort them out manually. Then you can have the result you require for all of the files that you can come across.
Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
-
- User
- Posts: 83
- Joined: Wed Mar 25, 2015 10:15 am
Re: Get full text from PDF
Sorry i tried sorting based on top and left positions of Boxes, lines, ...
But nothing gives a reasonable result.
(I added the PDF as attachment.)
But nothing gives a reasonable result.
(I added the PDF as attachment.)
Code: Select all
MyLines = new LineOrder[NbLine];
NbLine = 0;
for (uint j = 0; j < MyPageText.BlocksCount; j++)
{
for (uint k = 0; k < MyPageText.BlockInfo[j].ParaCount; k++)
{
for (uint l = 0; l < MyPageText.BlockInfo[j].ParaInfo[k].nLinesCount; l++)
{
MyLines[NbLine].BlockID = j;
MyLines[NbLine].ParaID = k;
MyLines[NbLine].LineID = l + MyPageText.BlockInfo[j].ParaInfo[k].nFirstLineIndex;
MyLines[NbLine].Top = MyPageText.get_LineInfo(MyLines[NbLine].LineID).rcBBox.top;
MyLines[NbLine].Left = MyPageText.get_LineInfo(MyLines[NbLine].LineID).rcBBox.left;
NbLine++;
}
}
}
// array sort:
Array.Sort(MyLines, delegate(LineOrder x, LineOrder y)
{
if (x.Top == y.Top) {
return x.Left.CompareTo(y.Left);
}
return x.Top.CompareTo(y.Top);
});
for (uint j = 0; j < NbLine ; j++)
{
uint FirstChar = MyPageText.get_LineInfo(MyLines[j].LineID).nFirstCharIndex;
uint CharCount = MyPageText.get_LineInfo(MyLines[j].LineID).nCharsCount;
Docinfo2 += System.Environment.NewLine;
Console.WriteLine("j=" + j + " LineID: " + MyLines[j].LineID + " Top: " + MyLines[j].Top + " Left: " + MyLines[j].Left);
for (uint m = 0; m < CharCount; m++)
{
uint currentChar = m + FirstChar;
if (MyPageText.get_CharFlags(currentChar) == (uint)PXC_TextCharFlags.TCF_LineBegin)
{
Docinfo2 += System.Environment.NewLine;
}
Docinfo2 += MyPageText.GetChars(currentChar, 1);
}
}
You do not have the required permissions to view the files attached to this post.
-
- User
- Posts: 5522
- Joined: Fri Nov 21, 2014 8:27 am
Re: Get full text from PDF
I will experiment with this code and will reply with the results.
Cheers,
Alex
Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ