ContentItem.Text_GetData[2|SA]()

JesseH · Post by **JesseH** » Wed Nov 09, 2016 7:26 pm

Hi,

I'm trying to extract text from a content item using the Text_GetData methods. It seems to work with some text, but not so much with others.

For instance, in the attached PDF there are two lines of text. When extracting the text using PageText, it reads like this:

Code: Select all

Footer around this locationThis is a second line.

But when either constructing the string byte-by-byte using GetDataSA() or just getting the string using Text_GetData2().GetString(), the first line reads:

Code: Select all

"\0'\0P\0P\0U\0F\0S\0\0B\0S\0P\0V\0O\0E\0\0U\0I\0J\0T\0\0M\0P\0D\0B\0U\0J\0P\0O"

Is there a flag or encoder/decoder that needs to be set here?

The (C#) code I'm currently working on is a simple analysis tool that uses CoreAPI to extract various info from the PDF document. Here is the code I'm using to (try to) get the text form the ContentItem interface:

Code: Select all

static public string GetContentText(IPXC_ContentItem pSrcContent)
        {
            var sRet = ""; 
            Array oByteBuffer = null;
            
            // Simpler code to get the text, but has same results as original code.
            sRet = pSrcContent.Text_GetData2().GetString();
            var oFlags = pSrcContent.Text_GetData2().GetStringFlags();

            // Original workaround for the missing Text_GetText method.
            pSrcContent.Text_GetDataSA(out oByteBuffer);
            var oByteList = oByteBuffer.OfType<byte>().ToList();
            for (int i = 0; i < oByteList.Count(); i++)
            {
                sRet = sRet + (char)oByteList[i];
            }

            return sRet;
        }

In case it's at all helpful, here is the output of my tool. The first run lists all the content items in the file, and the second run just dumps the PageText from each page.

Code: Select all

C:\Temp\PDF Editing Sandbox\Active>PDFAnalysis.exe -c
Extracting ContentItems from %d documents
##### Processing file [C:\Temp\PDF Editing Sandbox\Active\ImageTest.pdf]#####################################################################
-------------------- Page: [1] ------------------------------------------------------------------------

--- Content Item 0 -----
  Type: [CIT_BeginContainer]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 1 -----
  Type: [CIT_XForm]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 2 -----
  Type: [CIT_EndContainer]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 3 -----
  Type: [CIT_BeginContainer]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 4 -----
  Type: [CIT_XForm]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 5 -----
  Type: [CIT_EndContainer]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 6 -----
  Type: [CIT_Text]
  BBox: [45.327, 72.060, 205.524, 61.515]
  Value: [ ' P P U F S ☺ B S P V O E ☺ U I J T ☺ M P D B U J P O]
--- Content Item 7 -----
  Type: [CIT_Text]
  BBox: [30.927, 71.916, 172.752, 47.115]
  Value: [ 5 I J T ☺ J T ☺ B ☺ T F D P O E ☺ M J O F ☼]
--- Content Item 8 -----
  Type: [CIT_Image]
  BBox: [730.488, 428.972, 531.932, 762.888]
 Height: [135]
 Width:  [429]


-------------------- Page: [2] ------------------------------------------------------------------------

--- Content Item 0 -----
  Type: [CIT_BeginContainer]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 1 -----
  Type: [CIT_XForm]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 2 -----
  Type: [CIT_EndContainer]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 3 -----
  Type: [CIT_BeginContainer]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 4 -----
  Type: [CIT_XForm]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 5 -----
  Type: [CIT_EndContainer]
  BBox: [0.000, 0.000, 0.000, 0.000]



C:\Temp\PDF Editing Sandbox\Active>PDFAnalysis.exe -t
Extracting Text from %d documents
##### Processing file [C:\Temp\PDF Editing Sandbox\Active\ImageTest.pdf]#####################################################################
--------------------[0]------------------------------------------------------------------------

Footer around this locationThis is a second line.

--------------------[1]------------------------------------------------------------------------





C:\Temp\PDF Editing Sandbox\Active>

Thu Nov 10, 2016 6:43 am

Hi JesseH.
Actually text in PDF files stored as multibyte

strings, and how to interpret that data depends on which font is used. You are trying to get text using low level functions and therefore receiving raw data, which must be than translated to real Unicoddes. for now you have two options:
1. Use higher level functions to deal with text on page (or in content) using IPXC_Page::GetText (IPXC_Content::GetText) methods.
2. Read PDF specification section 9 Text and especially subsection 9.10 Extraction of text content and learn how to interpret raw data.
HTH.

ContentItem.Text_GetData[2|SA]()

ContentItem.Text_GetData[2|SA]()

Re: ContentItem.Text_GetData[2|SA]()