Pandas Excel Reading Failure Case Analysis
- Published on
- ...
- Authors

- Name
- Huashan
- @herohuashan
Date: 2025-10-16 Tags: #Python #Pandas #Excel #Debug #Experience Summary
🔍 Problem Description
When processing an Excel file (AP_LF 300 in DI Water 50 1Hz_1_20251016_100357上午.xlsx), an anomaly occurred when reading column B data from the Oscillation worksheet using pandas' read_excel() method:
- Visible to the eye: Column B rows 2-15 in Excel have data (17 values like 2.64, -0.07, 2.64, etc.)
- Pandas reading: Only 3 values read (B16-B18), first 15 rows are all NaN
- Actual situation: B1 is empty, B2-B18 all have data
🔬 Troubleshooting Process
1. Initial Attempt (Failed ❌)
df = pd.read_excel(file, sheet_name='Oscillation', header=None)
b_column = df[1]
2. Try Different Pandas Parameters (All Failed ❌)
header=None- Failedheader=0- Column B named__EMPTY, but data still NaNskiprows=1- Failedengine='openpyxl'- Failed
3. Direct Reading with openpyxl (Success ✅)
import openpyxl
wb = openpyxl.load_workbook(file, data_only=True)
ws = wb['Oscillation']
for row in range(1, 19):
value = ws[f'B{row}'].value
print(f"B{row}: {value}")
4. Verify Data Exists
Using SheetJS (JavaScript) to check raw Excel file structure:
const cell = oscillationSheet['B2'];
console.log(cell.v); // 2.64
console.log(cell.t); // 'n' (number type)
✅ Confirmed data actually exists in the Excel file.
🎯 Root Cause Analysis
Pandas' "Smart" Recognition Mechanism Failure
1. Incorrect Data Area Judgment
- Excel internal
dimensiontag defines worksheet range asA1:C18 - Cell B1 is empty
- Pandas may incorrectly judge B2-B15 are not in valid data area during parsing due to B1 being empty + some internal marker anomalies
2. Difference Between pandas vs openpyxl Reading
pandas.read_excel():
└─ Calls openpyxl
└─ Smart data area recognition (may fail)
└─ Data type conversion and optimization
└─ Skip "invalid" areas
openpyxl.iter_rows():
└─ Iterate directly by row
└─ Read all cells
└─ No smart judgment (more reliable)
3. Possible File Format Issues
- Excel file may have special internal markers
- First 15 rows of column B may be marked as "blank" rather than NULL
- Pandas' parser cannot properly handle this edge case
✅ Solution
Best Solution: Bypass pandas for Direct Reading
import pandas as pd
import openpyxl
def read_excel_robust(filepath, sheet_name):
"""
Use openpyxl for direct reading, bypassing pandas' smart recognition
Suitable for "strange" Excel files where pandas reading fails
"""
wb = openpyxl.load_workbook(filepath, data_only=True)
ws = wb[sheet_name]
# Iterate all rows directly without assumptions
raw_rows = [row for row in ws.iter_rows(values_only=True)]
wb.close()
# Manually build DataFrame
df = pd.DataFrame(raw_rows)
return df
# Usage example
df = read_excel_robust('file.xlsx', 'Oscillation')
b_column = df[1] # Column B data (index starts from 0)
Solution Advantages
✅ Completely Reliable - Does not depend on pandas' data area judgment ✅ Raw Data - Reads actual Excel cell content ✅ Strong Adaptability - Suitable for various "abnormal" Excel files ✅ Transparent and Controllable - Clear understanding of every step
📚 Key Parameter Description
openpyxl.load_workbook(
filename,
data_only=True, # ⭐ Read calculated formula values, not formulas themselves
read_only=False # Set to True for read-only to improve performance
)
🚀 Production Environment Best Practices
def read_oscillation_columns(filepath: str, retries: int = 2):
"""
Production-grade Excel reading function
Includes retry mechanism, error handling, resource management
"""
last_err = None
# Retry mechanism (handle file being occupied)
for attempt in range(retries + 1):
try:
wb = openpyxl.load_workbook(filepath, data_only=True)
break
except PermissionError as e:
last_err = e
if attempt < retries:
import time
time.sleep(0.4)
except Exception as e:
return None
else:
return None
try:
if 'Oscillation' not in wb.sheetnames:
return None
ws = wb['Oscillation']
raw_rows = [r for r in ws.iter_rows(values_only=True)]
# Ensure resource release
wb.close()
if not raw_rows:
return None
df = pd.DataFrame(raw_rows)
return df
except Exception:
wb.close()
return None
💡 Experience Summary
Core Lessons
Don't Blindly Trust Pandas' "Smart" Features
- Pandas' smart recognition works well in most cases
- But may fail with specially formatted Excel files
- Direct reading is more reliable for critical data processing
Layered Diagnosis
Pandas reading failed ↓ Try different pandas parameters ↓ Use underlying library (openpyxl) for direct reading ↓ Check raw file structure (SheetJS/openpyxl)Robust Design for Production Environment
- Retry mechanism (handle file occupation)
- Comprehensive error handling
- Resource management (ensure wb.close())
- Logging (facilitate debugging)
Applicable Scenarios
This solution is especially suitable for:
- ❌ Pandas reading returns abnormal data
- ❌ Incorrect data area judgment
- ❌ Specially formatted Excel files
- ❌ Scenarios requiring precise control over reading process
Performance Considerations
- Direct openpyxl reading is slightly slower than optimized pandas reading
- But for "strange" files, Reliability > Performance
- For large files, consider:
read_only=Trueparameter- Read only specific needed ranges
- Use
iter_rows(min_row=x, max_row=y)to limit range
🔗 Related Resources
- openpyxl Official Documentation
- pandas read_excel Documentation
- Project code:
oscillation_aggregator.py
🔗 Related Notes
- [Python Data Processing Best Practices](/Python Data Processing Best Practices)
- [Production Environment Error Handling Strategy](/Production Environment Error Handling Strategy)
- [Handling Corrupted Excel Files](/Handling Corrupted Excel Files)
Last Updated: 2025-10-16 Author: Your Name
Related Posts
Complete Guide to Migrating from GitHub Pages to Cloudflare Pages
Detailed documentation of the complete migration process from GitHub Pages to Cloudflare Pages, including global CDN acceleration, DNS configuration, and build settings optimization.
Fixing NoMachine Menu Unresponsive Issue on macOS 26
Solving the bug in NoMachine 9.2.14 on macOS 26 (Tahoe) where secondary menus cannot be clicked, completely resolving menu unresponsiveness by modifying configuration files.