{"id":6720,"date":"2024-02-15T13:03:43","date_gmt":"2024-02-15T20:03:43","guid":{"rendered":"https:\/\/jeremywhittaker.com\/?p=6720"},"modified":"2024-05-28T15:47:48","modified_gmt":"2024-05-28T22:47:48","slug":"extracting-text-from-a-pdf-using-ocr-and-python","status":"publish","type":"post","link":"https:\/\/new.jeremywhittaker.com\/index.php\/2024\/02\/15\/extracting-text-from-a-pdf-using-ocr-and-python\/","title":{"rendered":"Extracting text from a PDF using OCR, Python, and Google Colab for free"},"content":{"rendered":"\n<p>If you&#8217;ve ever downloaded a scanned PDF and tried to search it you&#8217;ll quickly realize this isn&#8217;t possible. Here is how you can use Python to extract the text from and PDF file and make it searchable. <\/p>\n\n\n\n<script src=\"https:\/\/gist.github.com\/JeremyWhittaker\/b74bb72482020f2f41e9d768f7beeeb2.js\"><\/script>\n","protected":false},"excerpt":{"rendered":"<p>If you&#8217;ve ever downloaded a scanned PDF and tried to search it you&#8217;ll quickly realize this isn&#8217;t possible. Here is how you can use Python to extract the&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-6720","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.2 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Extracting text from a PDF using OCR, Python, and Google Colab for free - Jeremy Whittaker<\/title>\n<meta name=\"robots\" content=\"noindex, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Extracting text from a PDF using OCR, Python, and Google Colab for free - Jeremy Whittaker\" \/>\n<meta property=\"og:description\" content=\"If you&#8217;ve ever downloaded a scanned PDF and tried to search it you&#8217;ll quickly realize this isn&#8217;t possible. Here is how you can use Python to extract the...\" \/>\n<meta property=\"og:url\" content=\"https:\/\/new.jeremywhittaker.com\/index.php\/2024\/02\/15\/extracting-text-from-a-pdf-using-ocr-and-python\/\" \/>\n<meta property=\"og:site_name\" content=\"Jeremy Whittaker\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/WhittakerJeremy\" \/>\n<meta property=\"article:author\" content=\"https:\/\/www.facebook.com\/WhittakerJeremy\" \/>\n<meta property=\"article:published_time\" content=\"2024-02-15T20:03:43+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-05-28T22:47:48+00:00\" \/>\n<meta name=\"author\" content=\"JeremyWhittaker\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"JeremyWhittaker\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"1 minute\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/new.jeremywhittaker.com\/index.php\/2024\/02\/15\/extracting-text-from-a-pdf-using-ocr-and-python\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/new.jeremywhittaker.com\/index.php\/2024\/02\/15\/extracting-text-from-a-pdf-using-ocr-and-python\/\"},\"author\":{\"name\":\"JeremyWhittaker\",\"@id\":\"https:\/\/new.jeremywhittaker.com\/#\/schema\/person\/ed0edfdefb3e180693efef453372980c\"},\"headline\":\"Extracting text from a PDF using OCR, Python, and Google Colab for free\",\"datePublished\":\"2024-02-15T20:03:43+00:00\",\"dateModified\":\"2024-05-28T22:47:48+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/new.jeremywhittaker.com\/index.php\/2024\/02\/15\/extracting-text-from-a-pdf-using-ocr-and-python\/\"},\"wordCount\":53,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/new.jeremywhittaker.com\/#\/schema\/person\/ed0edfdefb3e180693efef453372980c\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/new.jeremywhittaker.com\/index.php\/2024\/02\/15\/extracting-text-from-a-pdf-using-ocr-and-python\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/new.jeremywhittaker.com\/index.php\/2024\/02\/15\/extracting-text-from-a-pdf-using-ocr-and-python\/\",\"url\":\"https:\/\/new.jeremywhittaker.com\/index.php\/2024\/02\/15\/extracting-text-from-a-pdf-using-ocr-and-python\/\",\"name\":\"Extracting text from a PDF using OCR, Python, and Google Colab for free - Jeremy Whittaker\",\"isPartOf\":{\"@id\":\"https:\/\/new.jeremywhittaker.com\/#website\"},\"datePublished\":\"2024-02-15T20:03:43+00:00\",\"dateModified\":\"2024-05-28T22:47:48+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/new.jeremywhittaker.com\/index.php\/2024\/02\/15\/extracting-text-from-a-pdf-using-ocr-and-python\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/new.jeremywhittaker.com\/index.php\/2024\/02\/15\/extracting-text-from-a-pdf-using-ocr-and-python\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/new.jeremywhittaker.com\/index.php\/2024\/02\/15\/extracting-text-from-a-pdf-using-ocr-and-python\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/new.jeremywhittaker.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Extracting text from a PDF using OCR, Python, and Google Colab for free\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/new.jeremywhittaker.com\/#website\",\"url\":\"https:\/\/new.jeremywhittaker.com\/\",\"name\":\"Jeremy Whittaker\",\"description\":\"Research, software, markets, housing, and energy\",\"publisher\":{\"@id\":\"https:\/\/new.jeremywhittaker.com\/#\/schema\/person\/ed0edfdefb3e180693efef453372980c\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/new.jeremywhittaker.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/new.jeremywhittaker.com\/#\/schema\/person\/ed0edfdefb3e180693efef453372980c\",\"name\":\"JeremyWhittaker\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/secure.gravatar.com\/avatar\/c8ac20e6dfa86b5f27ce9bffee4851099770cbea5ae7338a274865bfbc8c0218?s=96&d=retro&r=g\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/c8ac20e6dfa86b5f27ce9bffee4851099770cbea5ae7338a274865bfbc8c0218?s=96&d=retro&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/c8ac20e6dfa86b5f27ce9bffee4851099770cbea5ae7338a274865bfbc8c0218?s=96&d=retro&r=g\",\"caption\":\"JeremyWhittaker\"},\"logo\":{\"@id\":\"https:\/\/secure.gravatar.com\/avatar\/c8ac20e6dfa86b5f27ce9bffee4851099770cbea5ae7338a274865bfbc8c0218?s=96&d=retro&r=g\"},\"sameAs\":[\"http:\/\/www.jeremywhittaker.com\",\"https:\/\/www.facebook.com\/WhittakerJeremy\",\"https:\/\/www.linkedin.com\/in\/jeremywhittaker\/\"],\"url\":\"https:\/\/new.jeremywhittaker.com\/index.php\/author\/jeremywhittaker\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Extracting text from a PDF using OCR, Python, and Google Colab for free - Jeremy Whittaker","robots":{"index":"noindex","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"og_locale":"en_US","og_type":"article","og_title":"Extracting text from a PDF using OCR, Python, and Google Colab for free - Jeremy Whittaker","og_description":"If you&#8217;ve ever downloaded a scanned PDF and tried to search it you&#8217;ll quickly realize this isn&#8217;t possible. Here is how you can use Python to extract the...","og_url":"https:\/\/new.jeremywhittaker.com\/index.php\/2024\/02\/15\/extracting-text-from-a-pdf-using-ocr-and-python\/","og_site_name":"Jeremy Whittaker","article_publisher":"https:\/\/www.facebook.com\/WhittakerJeremy","article_author":"https:\/\/www.facebook.com\/WhittakerJeremy","article_published_time":"2024-02-15T20:03:43+00:00","article_modified_time":"2024-05-28T22:47:48+00:00","author":"JeremyWhittaker","twitter_card":"summary_large_image","twitter_misc":{"Written by":"JeremyWhittaker","Est. reading time":"1 minute"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/new.jeremywhittaker.com\/index.php\/2024\/02\/15\/extracting-text-from-a-pdf-using-ocr-and-python\/#article","isPartOf":{"@id":"https:\/\/new.jeremywhittaker.com\/index.php\/2024\/02\/15\/extracting-text-from-a-pdf-using-ocr-and-python\/"},"author":{"name":"JeremyWhittaker","@id":"https:\/\/new.jeremywhittaker.com\/#\/schema\/person\/ed0edfdefb3e180693efef453372980c"},"headline":"Extracting text from a PDF using OCR, Python, and Google Colab for free","datePublished":"2024-02-15T20:03:43+00:00","dateModified":"2024-05-28T22:47:48+00:00","mainEntityOfPage":{"@id":"https:\/\/new.jeremywhittaker.com\/index.php\/2024\/02\/15\/extracting-text-from-a-pdf-using-ocr-and-python\/"},"wordCount":53,"commentCount":0,"publisher":{"@id":"https:\/\/new.jeremywhittaker.com\/#\/schema\/person\/ed0edfdefb3e180693efef453372980c"},"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/new.jeremywhittaker.com\/index.php\/2024\/02\/15\/extracting-text-from-a-pdf-using-ocr-and-python\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/new.jeremywhittaker.com\/index.php\/2024\/02\/15\/extracting-text-from-a-pdf-using-ocr-and-python\/","url":"https:\/\/new.jeremywhittaker.com\/index.php\/2024\/02\/15\/extracting-text-from-a-pdf-using-ocr-and-python\/","name":"Extracting text from a PDF using OCR, Python, and Google Colab for free - Jeremy Whittaker","isPartOf":{"@id":"https:\/\/new.jeremywhittaker.com\/#website"},"datePublished":"2024-02-15T20:03:43+00:00","dateModified":"2024-05-28T22:47:48+00:00","breadcrumb":{"@id":"https:\/\/new.jeremywhittaker.com\/index.php\/2024\/02\/15\/extracting-text-from-a-pdf-using-ocr-and-python\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/new.jeremywhittaker.com\/index.php\/2024\/02\/15\/extracting-text-from-a-pdf-using-ocr-and-python\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/new.jeremywhittaker.com\/index.php\/2024\/02\/15\/extracting-text-from-a-pdf-using-ocr-and-python\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/new.jeremywhittaker.com\/"},{"@type":"ListItem","position":2,"name":"Extracting text from a PDF using OCR, Python, and Google Colab for free"}]},{"@type":"WebSite","@id":"https:\/\/new.jeremywhittaker.com\/#website","url":"https:\/\/new.jeremywhittaker.com\/","name":"Jeremy Whittaker","description":"Research, software, markets, housing, and energy","publisher":{"@id":"https:\/\/new.jeremywhittaker.com\/#\/schema\/person\/ed0edfdefb3e180693efef453372980c"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/new.jeremywhittaker.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/new.jeremywhittaker.com\/#\/schema\/person\/ed0edfdefb3e180693efef453372980c","name":"JeremyWhittaker","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/c8ac20e6dfa86b5f27ce9bffee4851099770cbea5ae7338a274865bfbc8c0218?s=96&d=retro&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/c8ac20e6dfa86b5f27ce9bffee4851099770cbea5ae7338a274865bfbc8c0218?s=96&d=retro&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/c8ac20e6dfa86b5f27ce9bffee4851099770cbea5ae7338a274865bfbc8c0218?s=96&d=retro&r=g","caption":"JeremyWhittaker"},"logo":{"@id":"https:\/\/secure.gravatar.com\/avatar\/c8ac20e6dfa86b5f27ce9bffee4851099770cbea5ae7338a274865bfbc8c0218?s=96&d=retro&r=g"},"sameAs":["http:\/\/www.jeremywhittaker.com","https:\/\/www.facebook.com\/WhittakerJeremy","https:\/\/www.linkedin.com\/in\/jeremywhittaker\/"],"url":"https:\/\/new.jeremywhittaker.com\/index.php\/author\/jeremywhittaker\/"}]}},"_links":{"self":[{"href":"https:\/\/new.jeremywhittaker.com\/index.php\/wp-json\/wp\/v2\/posts\/6720","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/new.jeremywhittaker.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/new.jeremywhittaker.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/new.jeremywhittaker.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/new.jeremywhittaker.com\/index.php\/wp-json\/wp\/v2\/comments?post=6720"}],"version-history":[{"count":6,"href":"https:\/\/new.jeremywhittaker.com\/index.php\/wp-json\/wp\/v2\/posts\/6720\/revisions"}],"predecessor-version":[{"id":7748,"href":"https:\/\/new.jeremywhittaker.com\/index.php\/wp-json\/wp\/v2\/posts\/6720\/revisions\/7748"}],"wp:attachment":[{"href":"https:\/\/new.jeremywhittaker.com\/index.php\/wp-json\/wp\/v2\/media?parent=6720"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/new.jeremywhittaker.com\/index.php\/wp-json\/wp\/v2\/categories?post=6720"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/new.jeremywhittaker.com\/index.php\/wp-json\/wp\/v2\/tags?post=6720"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}