I'm trying to extract text from a content item using the Text_GetData methods. It seems to work with some text, but not so much with others.
For instance, in the attached PDF there are two lines of text. When extracting the text using PageText, it reads like this:
Code: Select all
Footer around this locationThis is a second line.
Code: Select all
"\0'\0P\0P\0U\0F\0S\0\0B\0S\0P\0V\0O\0E\0\0U\0I\0J\0T\0\0M\0P\0D\0B\0U\0J\0P\0O"
The (C#) code I'm currently working on is a simple analysis tool that uses CoreAPI to extract various info from the PDF document. Here is the code I'm using to (try to) get the text form the ContentItem interface:
Code: Select all
static public string GetContentText(IPXC_ContentItem pSrcContent)
{
var sRet = "";
Array oByteBuffer = null;
// Simpler code to get the text, but has same results as original code.
sRet = pSrcContent.Text_GetData2().GetString();
var oFlags = pSrcContent.Text_GetData2().GetStringFlags();
// Original workaround for the missing Text_GetText method.
pSrcContent.Text_GetDataSA(out oByteBuffer);
var oByteList = oByteBuffer.OfType<byte>().ToList();
for (int i = 0; i < oByteList.Count(); i++)
{
sRet = sRet + (char)oByteList[i];
}
return sRet;
}
In case it's at all helpful, here is the output of my tool. The first run lists all the content items in the file, and the second run just dumps the PageText from each page.
Code: Select all
C:\Temp\PDF Editing Sandbox\Active>PDFAnalysis.exe -c
Extracting ContentItems from %d documents
##### Processing file [C:\Temp\PDF Editing Sandbox\Active\ImageTest.pdf]#####################################################################
-------------------- Page: [1] ------------------------------------------------------------------------
--- Content Item 0 -----
Type: [CIT_BeginContainer]
BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 1 -----
Type: [CIT_XForm]
BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 2 -----
Type: [CIT_EndContainer]
BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 3 -----
Type: [CIT_BeginContainer]
BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 4 -----
Type: [CIT_XForm]
BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 5 -----
Type: [CIT_EndContainer]
BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 6 -----
Type: [CIT_Text]
BBox: [45.327, 72.060, 205.524, 61.515]
Value: [ ' P P U F S ☺ B S P V O E ☺ U I J T ☺ M P D B U J P O]
--- Content Item 7 -----
Type: [CIT_Text]
BBox: [30.927, 71.916, 172.752, 47.115]
Value: [ 5 I J T ☺ J T ☺ B ☺ T F D P O E ☺ M J O F ☼]
--- Content Item 8 -----
Type: [CIT_Image]
BBox: [730.488, 428.972, 531.932, 762.888]
Height: [135]
Width: [429]
-------------------- Page: [2] ------------------------------------------------------------------------
--- Content Item 0 -----
Type: [CIT_BeginContainer]
BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 1 -----
Type: [CIT_XForm]
BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 2 -----
Type: [CIT_EndContainer]
BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 3 -----
Type: [CIT_BeginContainer]
BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 4 -----
Type: [CIT_XForm]
BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 5 -----
Type: [CIT_EndContainer]
BBox: [0.000, 0.000, 0.000, 0.000]
C:\Temp\PDF Editing Sandbox\Active>PDFAnalysis.exe -t
Extracting Text from %d documents
##### Processing file [C:\Temp\PDF Editing Sandbox\Active\ImageTest.pdf]#####################################################################
--------------------[0]------------------------------------------------------------------------
Footer around this locationThis is a second line.
--------------------[1]------------------------------------------------------------------------
C:\Temp\PDF Editing Sandbox\Active>