Back to Blog
PythonPlaywrightRPAAutomationProductivity

Automating PDF Retrieval from a Legacy Financial System with Python + Playwright

The Problem

At my day job, staff had to manually pull voucher PDFs from the university's financial system — 3,000 times a year.

The steps were always the same:

  1. Log in to the financial system
  2. Enter a budget code and search
  3. Click the voucher number to open the detail screen
  4. Click "Print Voucher" → "Preview"
  5. Save the PDF

2 minutes × 3,000 cases = ~100 hours of repetitive work per year.

The system was a legacy ASP.NET app with no API, no export feature, and strict security policies. Commercial RPA tools either failed or required expensive licenses to cover edge cases. So I built a custom automation with Python and Playwright.


Technical Choices

Challenge Solution Reason
Browser automation Playwright Strong multi-window and async support
Avoiding login friction CDP connection Reuse an existing authenticated browser session
PDF processing pyMuPDF Easy page-level extraction
Distribution PyInstaller Runs on machines without Python installed

Implementation

1. Connect to an Existing Browser Session via CDP

Automating login on a security-hardened system is painful. Instead, I had users launch Edge in debug mode using a batch file, log in manually, and then connected Playwright to that session via Chrome DevTools Protocol (CDP).

@echo off
start "" "C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe" ^
  --remote-debugging-port=9222 ^
  --user-data-dir="C:\edge-debug-profile"
browser = await p.chromium.connect_over_cdp("http://localhost:9222")
context = browser.contexts[0]

Users just double-click the batch file, log in once, and then run the automation. No credential storage, no login automation risk.

2. Force-Fill Hidden Inputs with JavaScript

In ASP.NET forms, some inputs are positioned off-screen and Playwright's standard fill() silently does nothing. The fix: inject the value directly via JavaScript and fire a change event to trigger the form's own handlers.

async def js_fill(page, selector, value):
    await page.evaluate(f"""
        (function() {{
            var el = document.querySelector('{selector}');
            var nativeInputValueSetter = Object.getOwnPropertyDescriptor(
                window.HTMLInputElement.prototype, 'value').set;
            nativeInputValueSetter.call(el, '{value}');
            el.dispatchEvent(new Event('change', {{ bubbles: true }}));
        }})();
    """)

3. Extract Only the Relevant Pages with pyMuPDF

The downloaded PDFs contain multiple vouchers combined into one file. I only need pages that contain the target budget code, so I extract those and discard the rest.

def extract_pages_with_code(pdf_bytes, budget_code, save_path):
    src = fitz.open(stream=pdf_bytes, filetype="pdf")
    dst = fitz.open()
    for i in range(len(src)):
        if budget_code in src[i].get_text():
            dst.insert_pdf(src, from_page=i, to_page=i)
    dst.save(save_path)

4. Excel-Based Config for Non-Technical Users

The tool is operated by admin staff, not developers. I kept the interface as familiar as possible: edit an Excel file, run the .exe.

📁 Distribution folder
  ├── launch-edge.bat         ← double-click to open debug browser
  ├── voucher-downloader.exe  ← the automation
  └── input.xlsx              ← the only file users touch

The spreadsheet is minimal:

Cell Content
B3 Budget code
A6 onwards Voucher number list

Results

Before After
Time per case ~2 min ~10 sec
Annual time (3,000 cases) ~100 hrs ~8 hrs
Time saved ▲92 hrs/year

Distribution Notes

PyInstaller bundles the Python runtime, so the .exe runs on machines without Python. The one remaining friction point: Playwright's browser driver needs a separate install step.

pip install playwright
playwright install msedge

This is currently the biggest barrier when distributing to non-technical staff. I'm evaluating options — packaging the driver inside the bundle or switching to a lighter HTTP-based approach for simpler cases.


Comparing with Power Automate Desktop

Python + Playwright isn't always the right answer. For broader rollouts, I'm also considering Power Automate Desktop (PAD):

Python + Playwright Power Automate Desktop
Setup barrier Moderate Low (GUI-based)
Flexibility High Medium
Distribution .exe via PyInstaller Share flows
Complex logic Strong Can struggle
Cost Free Included with Microsoft 365 (often)

I plan to reimplement this same workflow in PAD and compare the results. I'll write that up when it's done.


Takeaways

  • CDP connection is a clean way to reuse an authenticated browser session without automating login
  • JS injection is necessary when Playwright's native methods can't reach off-screen ASP.NET inputs
  • pyMuPDF makes page-level PDF extraction straightforward
  • Excel as config is still the most practical interface for non-technical operators
  • PyInstaller distribution works, but browser driver packaging remains an open problem

If you're dealing with a legacy system that no commercial RPA tool handles well, Python + Playwright is worth the setup cost.

Share:
View all posts