reading PDF file with pypdf no contents are captured. Please help

Discussion in 'Python' started by sujan.dasmahapatra, Jul 31, 2013.

  1. sujan.dasmahapatra

    sujan.dasmahapatra Member

    Joined:
    Jun 11, 2009
    Messages:
    39
    Likes Received:
    0
    Trophy Points:
    6
    Gender:
    Male
    I am trying to read a PDF file using pypdf and write onto a text file. But its not working. content value in the below code is just "u/n/n/n/n/n'...PDF file has 5 pages so 5 times new line character and in the begining 'u'..whats going wrong please help. why the contents are not coming. Any help is highly appreciated. Thanks Sujan
    Code:
    #!/usr/bin/python
    import pyPdf
    import sys
    
    def getPDFContent(path):
        content = ""
        p = file(path, "rb")
        pdf = pyPdf.PdfFileReader(p)
        for i in range(0, pdf.getNumPages()):
            content += pdf.getPage(i).extractText() + "\n"
        content = " ".join(content.replace(u"\xa0", " ").strip().split())
        return content
    
    def main():
        f= open('test.txt','w')
        pdfl = getPDFContent("test.pdf").encode("ascii", "ignore")
        f.write(pdfl)
        f.close()
    
    if __name__ == "__main__":
        main()
    
     
    Last edited: Jul 31, 2013

Share This Page

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice