Automating PDF Retrieval from a Legacy Financial System with Python + Playwright

The Problem

At my day job, staff had to manually pull voucher PDFs from the university's financial system — 3,000 times a year.

The steps were always the same:

Log in to the financial system
Enter a budget code and search
Click the voucher number to open the detail screen
Click "Print Voucher" → "Preview"
Save the PDF

2 minutes × 3,000 cases = ~100 hours of repetitive work per year.

The system was a legacy ASP.NET app with no API, no export feature, and strict security policies. Commercial RPA tools either failed or required expensive licenses to cover edge cases. So I built a custom automation with Python and Playwright.

Technical Choices

Challenge	Solution	Reason
Browser automation	Playwright	Strong multi-window and async support
Avoiding login friction	CDP connection	Reuse an existing authenticated browser session
PDF processing	pyMuPDF	Easy page-level extraction
Distribution	PyInstaller	Runs on machines without Python installed

Implementation

1. Connect to an Existing Browser Session via CDP

Automating login on a security-hardened system is painful. Instead, I had users launch Edge in debug mode using a batch file, log in manually, and then connected Playwright to that session via Chrome DevTools Protocol (CDP).

@echo off
start "" "C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe" ^
  --remote-debugging-port=9222 ^
  --user-data-dir="C:\edge-debug-profile"

browser = await p.chromium.connect_over_cdp("http://localhost:9222")
context = browser.contexts[0]

Users just double-click the batch file, log in once, and then run the automation. No credential storage, no login automation risk.

2. Force-Fill Hidden Inputs with JavaScript

In ASP.NET forms, some inputs are positioned off-screen and Playwright's standard fill() silently does nothing. The fix: inject the value directly via JavaScript and fire a change event to trigger the form's own handlers.

async def js_fill(page, selector, value):
    await page.evaluate(f"""
        (function() {{
            var el = document.querySelector('{selector}');
            var nativeInputValueSetter = Object.getOwnPropertyDescriptor(
                window.HTMLInputElement.prototype, 'value').set;
            nativeInputValueSetter.call(el, '{value}');
            el.dispatchEvent(new Event('change', {{ bubbles: true }}));
        }})();
    """)

3. Extract Only the Relevant Pages with pyMuPDF

The downloaded PDFs contain multiple vouchers combined into one file. I only need pages that contain the target budget code, so I extract those and discard the rest.

def extract_pages_with_code(pdf_bytes, budget_code, save_path):
    src = fitz.open(stream=pdf_bytes, filetype="pdf")
    dst = fitz.open()
    for i in range(len(src)):
        if budget_code in src[i].get_text():
            dst.insert_pdf(src, from_page=i, to_page=i)
    dst.save(save_path)

4. Excel-Based Config for Non-Technical Users

The tool is operated by admin staff, not developers. I kept the interface as familiar as possible: edit an Excel file, run the .exe.

📁 Distribution folder
  ├── launch-edge.bat         ← double-click to open debug browser
  ├── voucher-downloader.exe  ← the automation
  └── input.xlsx              ← the only file users touch

The spreadsheet is minimal:

Cell	Content
B3	Budget code
A6 onwards	Voucher number list

Results

	Before	After
Time per case	~2 min	~10 sec
Annual time (3,000 cases)	~100 hrs	~8 hrs
Time saved		▲92 hrs/year

Distribution Notes

PyInstaller bundles the Python runtime, so the .exe runs on machines without Python. The one remaining friction point: Playwright's browser driver needs a separate install step.

pip install playwright
playwright install msedge

This is currently the biggest barrier when distributing to non-technical staff. I'm evaluating options — packaging the driver inside the bundle or switching to a lighter HTTP-based approach for simpler cases.

Comparing with Power Automate Desktop

Python + Playwright isn't always the right answer. For broader rollouts, I'm also considering Power Automate Desktop (PAD):

	Python + Playwright	Power Automate Desktop
Setup barrier	Moderate	Low (GUI-based)
Flexibility	High	Medium
Distribution	.exe via PyInstaller	Share flows
Complex logic	Strong	Can struggle
Cost	Free	Included with Microsoft 365 (often)

I plan to reimplement this same workflow in PAD and compare the results. I'll write that up when it's done.

Takeaways

CDP connection is a clean way to reuse an authenticated browser session without automating login
JS injection is necessary when Playwright's native methods can't reach off-screen ASP.NET inputs
pyMuPDF makes page-level PDF extraction straightforward
Excel as config is still the most practical interface for non-technical operators
PyInstaller distribution works, but browser driver packaging remains an open problem

If you're dealing with a legacy system that no commercial RPA tool handles well, Python + Playwright is worth the setup cost.