Why does my script download a PDF that opens corrupted?

The most common cause is treating the attachment body as raw bytes when it is actually base64url-encoded. Both the Gmail API and Microsoft Graph return attachment data as base64url strings. You need to decode with the URL-safe alphabet before writing the bytes. In Python, use `base64.urlsafe_b64decode(data + '===')`. In Node, `Buffer.from(data, 'base64url')` works on modern runtimes. A naive `base64.b64decode` call produces garbage on a subset of attachments because the standard and URL-safe alphabets differ at two characters.

What is the difference between an inline PDF and a linked PDF in an email?

An inline PDF is physically embedded in the MIME body of the email as an attachment you can download via the attachment API. A linked PDF is referenced by a URL in the email body, usually pointing at a vendor portal behind authentication. The email contains no PDF bytes at all. Extractors that only walk attachments quietly miss linked PDFs, which is the default pattern for Amazon Business, Uber for Business, most ad platforms, and many enterprise vendors. Capturing linked PDFs requires either a portal integration or a browser-based tool that can click through with the user's session.

How do I handle forwarded invoices that arrive as email attachments?

Outlook and some corporate mail clients forward messages as message/rfc822 attachments rather than inline-quoted text. The outer email has no PDF, but a nested email inside does. Naive extractors that only check top-level attachments miss these entirely. The fix is recursive: for every attachment, check the MIME type, and if it is message/rfc822, parse it as a new email and walk its attachments. In Python, email.message_from_bytes handles this. Chains can go two or three levels deep if a client forwarded a forward, so the recursion must be unbounded (with a depth cap to prevent loops).

How do I avoid downloading the same invoice twice?

Use idempotency keys. Every Gmail message has a stable id field. Every IMAP message has a UID that is stable within a mailbox as long as UIDVALIDITY does not change. Every Microsoft Graph message has a stable id too. Store that identifier alongside the extracted record in your database with a unique constraint, and re-running the pipeline becomes a no-op for messages already seen. Gmail's History API returns only changes since a given historyId, so after the initial sweep you never re-scan the whole inbox.

How do I handle password-protected PDF invoices?

Encrypted PDFs fall into two buckets. User-password encryption requires a human-provided password to open the file at all, which is common for bank statements and some payroll providers. Owner-password encryption only restricts printing or copying but lets the file open freely, and most PDF libraries read these without special handling. For user-password files, you either store the password alongside the connection credentials or prompt for it at intake. Tools like qpdf and pikepdf can remove encryption programmatically once the password is known. Never store PDF passwords in plain configuration files or repository code.

Does the Gmail API have rate limits that matter for invoice extraction?

Gmail API uses a per-user quota measured in quota units rather than raw requests. messages.list costs 5 units per call and messages.attachments.get costs 5 units per call, against a default per-user quota of 250 units per second. For a single inbox this is effectively unlimited. For a multi-tenant tool syncing thousands of inboxes, you need exponential backoff on HTTP 429 responses and should batch messages.get requests using the format=metadata option where possible. The history.list endpoint is cheaper than repeated list calls once you have a baseline historyId.

When does building a custom extractor make sense versus using a SaaS tool?

Building makes sense when you have unusual filtering requirements, a compliance reason to keep the pipeline on your own infrastructure, or a team that genuinely enjoys owning the codebase. Using a SaaS tool makes sense when your volume is above 50 invoices per month, when you have multiple inboxes to coordinate, when your bookkeeper needs structured data rather than PDFs, or when the engineer who would build it has higher-value work to do. The crossover point is lower than most teams expect because the edge cases and maintenance costs compound quickly after the initial happy path is working.

Extract PDF Invoices from Email: Developer Guide (2026)

Every finance team eventually wants the same thing: a script that logs into an inbox, finds the PDF invoices, pulls them out, and drops them somewhere useful. In the abstract this is a Saturday afternoon project. In practice it becomes a small internal product with its own bugs, maintenance schedule, and failure modes that only surface in production.

This guide covers the actual mechanics: three PDF-in-email patterns developers encounter, the API code paths for Gmail, Microsoft Graph, and raw IMAP, how to deal with encrypted attachments and forwarded-as-attachment chains, the edge cases that eat a weekend, storage and indexing after retrieval, and an honest accounting of when to build versus when to buy a service.

Three PDF-in-email patterns and why they matter

The first thing that surprises engineers new to invoice extraction is that "the PDF is in the email" is not one shape. There are at least three distinct patterns, each requiring different code.

Pattern 1: Direct inline attachment. An email arrives with a PDF physically embedded in its MIME body. The subject is something like "Invoice #1047 from Acme Corp." The attachment shows up in any mail client, and the Gmail API or IMAP FETCH returns the bytes directly. Stripe, PayPal, AWS, Google Workspace, GitHub, Vercel, and most B2B SaaS billing platforms use this pattern. These are the easiest to capture: walk the MIME parts, find entries with a .pdf filename and a non-empty attachment ID, and download.

Pattern 2: Linked PDF behind authentication. The email is a formatted HTML receipt. The PDF invoice lives on the vendor's portal, reachable through an authenticated link embedded in the body. The email itself contains no PDF bytes. Amazon Business, Uber for Business, Meta Ads, Google Ads, and most advertising platforms use this pattern. Naive extractors that only check MIME attachments silently miss every invoice from these vendors. Following the link requires either a vendor-specific portal integration or a browser-based tool that can authenticate and click through. For two of the most common cases, see the Amazon Business portal page and the Stripe portal page, which document the exact billing URLs and what the downloaded PDF looks like.

Pattern 3: Forwarded-as-attachment chains. A bookkeeper receives forwarded emails from clients. Depending on the client's mail client, the forward may arrive as inline-quoted text (the original message appears in the body) or as a message/rfc822 attachment (the original message is a nested email with its own attachments). The outer email has no PDF directly. The PDF is one or two levels inside the forward chain. A script that only walks top-level MIME parts misses every invoice inside these chains, which is a consistent failure mode for accounting firms handling client inboxes.

Build your extractor to handle all three, or budget for the silent gaps.

API code paths: Gmail, Microsoft Graph, and imaplib

Gmail API

The canonical Gmail extraction flow has four steps.

Authenticate with the right scope. The scope you want is https://www.googleapis.com/auth/gmail.readonly. This grants read access to messages and attachments without the ability to send, delete, or modify. Any tool requesting gmail.modify or broader access for extraction work is over-provisioning. Apps using sensitive Gmail scopes also need to complete Google's CASA third-party security assessment before leaving the test sandbox.

List candidate messages. Call users.messages.list with a query like has:attachment filename:pdf subject:(invoice OR receipt OR billing) newer_than:90d. This uses the same operator syntax as the Gmail web UI. Walk nextPageToken until exhausted; do not assume the first page is complete. Gmail search is eventually consistent: a message arriving at 09:00 may not match a query until 09:01, so push-triggered extraction needs retries with backoff.

Fetch attachment bytes. For each candidate message, call users.messages.get with format=full to retrieve the complete MIME tree. Walk the parts array for entries with a filename ending in .pdf and a non-empty body.attachmentId. Then call users.messages.attachments.get with the message ID and attachment ID.

The byte-decoding detail that burns most first-time integrators: the data field is base64url-encoded, not standard base64. See the Gmail API attachment reference for the payload shape. In Python:

import base64

def decode_attachment(data: str) -> bytes:
    # Gmail strips padding; add it back before decoding
    padded = data + "==="
    return base64.urlsafe_b64decode(padded)

In Node.js on modern runtimes:

const bytes = Buffer.from(data, 'base64url');

A naive base64.b64decode call works on most files but fails on a subset where the URL-safe characters (- and _) appear, producing corrupted PDFs that look normal until a parser tries to read them.

Sync incrementally. Store the historyId from the last successful run, then call users.history.list with that ID to fetch only messages added since. This avoids re-scanning the whole inbox on every execution. For latency-sensitive pipelines, use users.watch to receive Pub/Sub push notifications on every mailbox change; this is how production-quality tools achieve near-real-time extraction.

Microsoft Graph for Outlook and Microsoft 365

The Microsoft Graph mail API follows the same structure with different naming. The scope you need is Mail.Read. Mail.ReadBasic exists for lighter permissions but omits message bodies and is not sufficient for extraction.

Fetch attachment bytes via:

GET /me/messages/{message-id}/attachments/{attachment-id}/$value

The $value suffix returns raw bytes directly, no base64 decoding needed. Without it you get JSON metadata. Attachments of type fileAttachment hold inline files. Attachments of type itemAttachment hold nested email items (the Outlook equivalent of message/rfc822 forwards); recurse into their attachments the same way you handle the Gmail nested-email case.

For incremental sync, use delta query:

GET /me/mailFolders/inbox/messages/delta

Store the returned delta token and pass it on subsequent calls to receive only changes since the last sync. Microsoft delta tokens expire after 30 days of inactivity, versus Gmail's historyId which is valid for 7 days. Keep this in mind for inboxes that go quiet for weeks.

Raw IMAP with Python's imaplib

Gmail and Microsoft Graph cover most modern business inboxes. A long tail of self-hosted, privacy-focused, and legacy mail hosts only speak IMAP.

import imaplib
import email
from email.message import Message
from typing import Iterator

def connect_imap(host: str, user: str, password: str) -> imaplib.IMAP4_SSL:
    conn = imaplib.IMAP4_SSL(host, 993)
    conn.login(user, password)
    return conn

def iter_pdf_attachments(
    conn: imaplib.IMAP4_SSL, query: str = '(SUBJECT "invoice")'
) -> Iterator[tuple[str, bytes]]:
    conn.select("INBOX")
    status, data = conn.uid("search", None, query)
    if status != "OK":
        return
    uids = data[0].split()
    for uid in uids:
        _, msg_data = conn.uid("fetch", uid, "(RFC822)")
        raw = msg_data[0][1]
        msg = email.message_from_bytes(raw)
        yield from extract_pdfs_from_message(msg)

def extract_pdfs_from_message(msg: Message) -> Iterator[tuple[str, bytes]]:
    for part in msg.walk():
        content_type = part.get_content_type()
        filename = part.get_filename() or ""
        if content_type == "message/rfc822":
            # Recurse into forwarded-as-attachment nested emails
            nested = part.get_payload(0)
            if isinstance(nested, Message):
                yield from extract_pdfs_from_message(nested)
        elif content_type == "application/pdf" or filename.lower().endswith(".pdf"):
            payload = part.get_payload(decode=True)
            if payload:
                yield filename, payload

Three IMAP pitfalls that scripts hit repeatedly. First, UIDVALIDITY: UIDs can become invalid if the server's UIDVALIDITY value changes on migration. Always check UIDVALIDITY on connect and rebuild your index if it differs. Second, IMAP IDLE (push notification) is unreliable on many servers even when advertised; use polling with a 60 to 300 second interval for billing inboxes. Third, UIDs are per-mailbox, not per-account: deduplication across INBOX and Sent requires the Message-ID header, which is globally unique when present.

Encrypted PDFs and password-protected invoices

Once you have the bytes, some PDFs will not open. There are two distinct encryption types.

User-password encryption requires a password to open the file at all. Banks, payroll providers, and some insurance companies send statements this way. Without the password, the PDF is opaque. The common open-source libraries (pdfplumber, pypdf2, pdfminer) raise an error or return empty output on first touch. pikepdf can open the file once the password is supplied:

import pikepdf

def open_encrypted_pdf(path: str, password: str) -> pikepdf.Pdf:
    return pikepdf.open(path, password=password)

For a team product, the password cannot live in a config file or source control. Store it in the same encrypted credential store as OAuth tokens, one record per source. Our AI processing feature page covers how Inbox Ledger handles per-source credential storage.

Owner-password encryption only restricts printing or copying but allows the file to open freely. Most PDF libraries handle these without any extra steps. If you are seeing open errors on seemingly readable files, check whether the file is owner-locked rather than user-locked; pikepdf can often remove owner restrictions outright.

One other failure mode: PDFs signed with a cryptographic signature (common in EU-compliant invoice systems that enforce qualified electronic signatures). Some libraries refuse to parse these out of caution. Try a different parsing library or strip the signature layer with qpdf before extracting text.

Multi-PDF emails: picking the right attachment

Some vendors attach two or three PDFs to a single email: an invoice, a packing slip, and a bank transfer instruction, all in one message. A script that takes only the first attachment by index loses real data. A script that takes all attachments without classifying them lands non-invoice files in the invoice table and corrupts totals.

The correct approach is to pull every PDF and classify each at the content level before deciding which to ingest. Classification does not have to be elaborate. A lightweight pre-classification step using file content (not just filename) can distinguish invoices from packing slips and from cover pages with reasonable accuracy. For the Inbox Ledger pipeline, every PDF goes through a content classification step before storage, and only files that score as financial documents are archived and extracted. Non-financial PDFs (marketing materials, product catalogs, signed NDAs) are dropped at the gate.

Filename heuristics help but are not sufficient. A file named invoice_2026_03.pdf is almost certainly an invoice. A file named document.pdf attached to an invoice email could be either. A file named statement_march.pdf could be a bank statement, a vendor statement, or a cover letter. Filename plus content classification beats either alone.

Edge cases that break naive scripts

Inline images disguised as attachments

Some mail clients attach a small inline image (a company logo or signature graphic) with a Content-Disposition: attachment header, which looks identical to a real attachment at the MIME level. Scripts that filter by Content-Disposition: attachment and file extension collect these. The fix is to also filter by Content-Type: only application/pdf and its aliases (application/x-pdf, application/acrobat) indicate a real PDF. image/png with a .pdf extension is not a PDF.

HTML receipt plus separate PDF

A common vendor pattern: the email body is a fully styled HTML receipt showing amount, line items, and payment confirmation. Attached to the same email is a "formal invoice" PDF that contains the same data in a printable layout. For personal expense tracking, the HTML body is often enough. For VAT reclaim or GST credits in most jurisdictions, only the PDF qualifies as the tax invoice because the HTML email does not carry a sequential invoice number or vendor tax ID in a machine-verifiable way.

An extractor that only processes attachments and ignores the HTML body handles this correctly for tax purposes, since it takes the PDF. An extractor that only processes the HTML body handles it incorrectly, since it captures a receipt not an invoice. Know which your pipeline does.

Body-only receipts with no PDF

A third variant: the email body is the only artifact. No PDF, no portal link, just formatted HTML showing the charge. This is common for smaller SaaS tools, marketplace sellers, and consumer services that have not built a proper billing system. For personal tracking, scraping the HTML body gives you the amounts. For audit-grade bookkeeping, body-only receipts are often insufficient and should be flagged for a "request proper invoice" workflow rather than auto-ingested into the accounting system.

Your extractor should distinguish this case explicitly and route it to a review queue rather than treating it the same as a PDF invoice.

Storage and indexing after retrieval

Getting the bytes is half the job. What you do with them determines whether the archive is useful.

Immutable storage. Auditors want proof that a file has not been silently altered since retrieval. Object storage with versioning enabled, S3 Object Lock (WORM mode), Azure Immutable Blob Storage, or a cryptographic hash stored alongside the file in a tamper-evident log all qualify. A Dropbox folder does not. For critical records, consider archiving in PDF/A format, the ISO standard designed specifically for long-term archival that guarantees consistent rendering decades later.

Structural indexing. A folder of PDFs is not queryable. An archive is only useful when you can answer "what did we pay to SaaS vendors in Q1 2026" or "find all invoices from this vendor for this date range." That requires extracted structured fields: vendor name, invoice number, issue date, due date, subtotal, tax by rate, total, currency, and line items, all written to a database table with proper indexes.

Deduplication. PDF attachments do not carry guaranteed unique identifiers. Two ways to deduplicate: store the email provider's message ID (stable and unique per provider), or compute a SHA-256 hash of the PDF bytes and use that as a content-addressable key. The hash approach catches the case where the same PDF arrives twice via different paths (one direct, one forwarded).

Retention policy. IRS records retention is generally three years from filing, extended to six years for material underreporting (IRS Publication 583). HMRC requires six years from the end of the last company financial year. EU VAT retention is five to ten years depending on jurisdiction. Your storage layer needs a retention policy attached, not just "keep everything until someone deletes it."

For the full argument about why Gmail alone is not a sufficient archive, our Gmail invoice extraction complete guide covers retention, immutability, and what happens when an inbox is deleted or offboarded.

Start for free and extract your first 10 invoices without a credit card.

Build vs buy: when your script becomes a product

Every few months a small finance team decides to write an extraction script. Here is an honest cost accounting of where that project ends up.

The initial sprint: 2 days. A minimal Python extractor handling Gmail, text-layer PDFs from a handful of known vendors, and output to a Google Sheet is 300 to 600 lines of code. A developer can draft it in two days. It works on the happy path.

Edge cases: 1 to 2 weeks. Add forwarded-as-attachment recursion, encrypted PDFs, multi-PDF classification, vendors with linked rather than inline PDFs, IMAP fallback for clients not on Gmail. Add 400 to 800 lines and a week of development.

Observability: 1 week. Logs for every message processed, retry metadata for failures, an alerting path for when the script stops working, a queue for manual review of low-confidence extractions. Without this, you do not know the script is broken until someone asks for a missing invoice.

Credential management: 2 to 5 days. OAuth tokens expire. Refresh them on schedule, store them encrypted, handle revocation mid-run. For multi-client inboxes, per-client credential isolation.

Vendor template coverage: ongoing. Either write regex-based extractors per vendor (hours per vendor, updated every time a vendor changes their PDF layout) or integrate a model-based extraction step (external cost, bounded maintenance). Mixing both creates drift. Stripe has changed their invoice PDF template three times in the last two years.

Testing: 1 week. A corpus of real invoice PDFs from your actual vendors, with sensitive fields masked, is the only way to verify that changes do not regress. Building it properly takes a week.

By the time the script is production-worthy, a two-day sprint has become six weeks of engineering time. At $200 per hour, that is roughly $48,000 of loaded engineering cost for a tool that still has a long tail of uncovered vendors.

For a solo engineer building this for fun or for personal use, that math is irrelevant. For a finance team that wanted invoice automation, the math usually lands on using a purpose-built tool. The crossover point is around 50 invoices per month, where the time saved by automation exceeds the subscription cost of a SaaS extractor within two billing cycles.

The comparison is not "my script" versus "a product someone else wrote." It is "my script, which I also own as a product, including debugging calls at quarter close" versus "a vendor's product that they own."

For teams already evaluating the landscape, our email extraction tools comparison covers the tradeoffs across self-hosted, semi-hosted, and fully managed options. For a deeper view of how the AI extraction pipeline works under the hood, see email invoice OCR: how it works.

Closing: pick the seams that match your volume

The right extraction setup depends on three variables: invoice volume, team technical appetite, and whether you care about maintenance costs compounding over time.

Below 30 invoices per month from five predictable vendors: a 40-line Google Apps Script saves PDFs to Drive and covers the job. No paid tooling needed.

Above 50 invoices per month, multiple inboxes, or a bookkeeper who needs structured data: the script grows into a product you did not plan to build. At that point, connecting a purpose-built extractor is faster to stand up, more accurate across vendor templates, and bounded in ongoing cost.

The thing to avoid is the middle: a half-built script that covers 80 percent of your volume and quietly misses the other 20. That is the scenario where an audit surfaces a missing quarter of invoices and no one can explain why. Either commit to the script as a real internal product with real testing and alerting, or use a tool where that work is already done.

If you want to see what full extraction looks like on your actual inbox, connect Gmail or Outlook with a read-only scope and let the service pull your last 30 days. For vendor-specific deep dives, the guides on downloading Stripe invoices automatically and getting Amazon Business invoices cover the two most common individual integrations.