The 12 JSON-to-CSV Edge Cases That Will Ruin Your Data Pipeline

# The 12 JSON-to-CSV Edge Cases That Will Ruin Your Data Pipeline

💡 Key Takeaways

When Nested Arrays Cost a Fintech $1.2M in Reporting Errors
Building Pipelines That Process 2 Billion Events Monthly
How Shopify's Webhook Nearly Destroyed a Client's Inventory System
Comparing JSON-to-CSV Libraries: What Actually Breaks in Production

When Nested Arrays Cost a Fintech $1.2M in Reporting Errors

Last March, I got a panicked Slack message at 11 PM from a fintech client's CFO. Their quarterly board report showed revenue numbers that didn't match their internal dashboards. The discrepancy? $1.2 million. For six weeks, their automated JSON-to-CSV conversion had been silently mangling nested transaction arrays, and nobody caught it until the board deck was already in investors' hands.

The root cause was deceptively simple: their payment processor's webhook delivered transaction data as nested JSON arrays—one transaction could have multiple line items, each with its own tax calculation. When their ETL pipeline flattened this to CSV, it duplicated parent records for each child item but failed to deduplicate the transaction totals. Every multi-item transaction got counted multiple times.

This wasn't a junior engineer's mistake. This was a senior team using a popular open-source library that had 50,000+ GitHub stars. The library worked perfectly for simple JSON structures. But real-world SaaS data is never simple. It's nested, inconsistent, and full of edge cases that only reveal themselves in production—usually at the worst possible moment.

I've spent eight years building ETL pipelines for SaaS companies, from early-stage startups to public companies processing billions of events monthly. I've debugged data corruption at 3 AM, rebuilt pipelines that silently failed for months, and learned that JSON-to-CSV conversion is where most data integrity issues hide. Not in the database. Not in the API. In that seemingly trivial transformation step that everyone assumes "just works."

Building Pipelines That Process 2 Billion Events Monthly

My background is in data engineering for high-volume SaaS platforms. I've built ingestion pipelines for marketing automation tools processing 500M+ events daily, analytics platforms handling clickstream data from 100M+ users, and financial systems where a single corrupted record can trigger regulatory violations.

The pattern I see repeatedly: teams focus on scaling their databases and optimizing their APIs, but they treat data transformation as an afterthought. They'll spend weeks optimizing a Postgres query that saves 50ms, then use a naive JSON-to-CSV converter that silently corrupts 0.01% of records. That 0.01% compounds over time until you're explaining to your board why your metrics don't match reality.

JSON-to-CSV conversion sits at a critical junction in most data pipelines. It's where structured, hierarchical data gets flattened into tabular format for analytics tools, spreadsheets, and legacy systems. This transformation is lossy by nature—CSV can't represent nested structures—but most implementations handle this loss poorly. They make implicit decisions about how to flatten data without documenting those decisions or validating the results.

The tools don't help. Most JSON-to-CSV libraries are built for simple use cases: flat objects with consistent schemas. They break down when you feed them real-world API responses with optional fields, nested arrays, polymorphic types, and inconsistent structures. And they break silently. No errors. No warnings. Just subtly corrupted data that looks fine until someone runs a financial report.

I've learned to treat JSON-to-CSV conversion as a critical system component that needs the same rigor as database migrations or API contracts. That means comprehensive testing, explicit handling of edge cases, and validation at every step. Here's what that looks like in practice.

How Shopify's Webhook Nearly Destroyed a Client's Inventory System

Three years ago, I worked with an e-commerce analytics company that aggregated data from multiple platforms. They had a seemingly simple pipeline: ingest webhooks from Shopify, Stripe, and other services, convert to CSV, load into their data warehouse, and generate reports for merchants.

One Monday morning, their support team got flooded with tickets. Merchants were seeing impossible inventory numbers—negative stock counts, products showing as "in stock" when they were sold out, and order quantities that didn't match their actual sales. The analytics platform's data was completely wrong, and it had been wrong for three days before anyone noticed.

The culprit was Shopify's variant structure. In Shopify's API, a product can have multiple variants (size, color, etc.), and each variant has its own inventory count. The JSON structure looks like this:

```json

{

"product_id": "12345",

🛠 Explore Our Tools

Excel to CSV Converter — Free, Online, Preserves Data → JSON to XML Converter — Free, Instant → How to Clean CSV Data — Free Guide →

"title": "T-Shirt",

"variants": [

{"id": "v1", "size": "S", "inventory": 10},

{"id": "v2", "size": "M", "inventory": 15},

{"id": "v3", "size": "L", "inventory": 8}

]

}

```

Their CSV converter flattened this by creating one row per variant, which seems reasonable. But here's the edge case: when a variant sold out and Shopify removed it from the variants array, the converter didn't create a row for it. The downstream system interpreted "missing row" as "no change" rather than "inventory now zero." Products showed as in-stock when they'd actually sold out.

The fix required explicit handling: when flattening variants, they needed to maintain a reference list of all known variant IDs and explicitly write zero-inventory rows for variants that disappeared from the JSON. This turned a "simple" conversion into a stateful operation that required tracking historical data.

The most dangerous data bugs are the ones that produce plausible-looking output. If the conversion had crashed or produced obviously wrong data, they would have caught it immediately. Instead, it produced CSV files that looked perfectly normal—they just happened to be missing critical information.

This pattern repeats across every SaaS integration I've built. The edge cases aren't exotic scenarios—they're normal variations in how APIs represent data. But most conversion tools treat them as exotic, which means they fail silently.

Comparing JSON-to-CSV Libraries: What Actually Breaks in Production

I've tested every major JSON-to-CSV library across Node.js, Python, and Go. Here's what breaks when you feed them real-world SaaS data:

Library	Nested Arrays	Missing Fields	Type Inconsistency	Null Handling	Production Ready
json2csv (Node)	Duplicates parent	Empty string	Coerces to string	Empty string	⚠️ With config
pandas (Python)	Fails or truncates	NaN	Preserves type	NaN or empty	⚠️ Requires normalization
csvkit (Python)	Stringifies	Empty string	Coerces to string	Empty string	❌ Too lossy
encoding/csv (Go)	Manual handling	Manual handling	Manual handling	Manual handling	✅ Full control
Custom solution	Explicit strategy	Explicit strategy	Explicit strategy	Explicit strategy	✅ Recommended

The "Production Ready" column reflects whether I'd trust this library with financial data without extensive testing and validation. Most libraries work fine for simple cases but make implicit decisions about edge cases that may not match your requirements.

The key insight: there's no "correct" way to flatten nested JSON to CSV. It depends on your use case. Are you preserving data for archival? Preparing data for analytics? Generating reports for end users? Each use case requires different handling of nested structures, missing fields, and type inconsistencies.

For the fintech client I mentioned earlier, we ended up building a custom converter that explicitly handled their transaction structure. It took three days to build and test, but it saved them from future data corruption issues. The generic library would have taken 30 minutes to integrate—and cost them millions in corrupted reports.

Why "Just Use Pandas" Is Terrible Advice for Production Pipelines

Every time I mention JSON-to-CSV conversion challenges, someone suggests: "Just use pandas.json_normalize()." This advice is well-meaning but dangerous for production systems.

Pandas is an excellent tool for data analysis and exploration. It's not designed for production ETL pipelines. Here's why:

Memory consumption is unpredictable. Pandas loads entire datasets into memory. For a 100MB JSON file, pandas might consume 500MB-1GB of RAM depending on the data structure. In a production pipeline processing thousands of files daily, this memory overhead becomes a scaling bottleneck. I've seen pandas-based pipelines that worked fine in development (processing sample files) but crashed in production when they hit larger datasets. Error handling is analysis-focused, not pipeline-focused. When pandas encounters malformed JSON or unexpected data types, it often coerces values silently or raises exceptions that are hard to handle gracefully. In a production pipeline, you need fine-grained control over error handling: log the error, quarantine the bad record, continue processing, and alert the team. Pandas makes this difficult. Type inference causes subtle bugs. Pandas tries to infer data types automatically, which sounds helpful but causes problems. A field that's always been numeric suddenly gets a string value? Pandas might convert the entire column to strings, changing how downstream systems interpret the data. Or it might fail with a cryptic error. Either way, you're debugging at 2 AM.

The difference between analysis tools and production tools is how they handle edge cases. Analysis tools optimize for the common case and let you manually fix problems. Production tools must handle every edge case automatically and predictably, even if that means more upfront configuration.

I've replaced three pandas-based pipelines with custom solutions in the past two years. Each time, the team initially resisted—"Why rebuild something that works?"—until they saw the production metrics. The custom solutions used 80% less memory, handled errors gracefully, and processed data 3-5x faster. More importantly, they were predictable. When something went wrong, we knew exactly what happened and why.

Pandas has its place. I use it constantly for data exploration and one-off analysis. But for production ETL pipelines that need to run reliably at scale, you need tools designed for that purpose.

Twelve Edge Cases That Will Corrupt Your Data

Here are the specific scenarios that break most JSON-to-CSV converters. I've encountered every one of these in production systems:

1. Nested arrays with inconsistent lengths. When a JSON object contains arrays of different lengths, most converters either duplicate the parent record for each array element (causing overcounting) or truncate to the shortest array (losing data). Example: a customer record with 3 addresses but 5 phone numbers. How many CSV rows should this produce? 2. Null vs. missing vs. empty string. JSON distinguishes between `null`, missing fields, and empty strings. CSV doesn't. Most converters treat all three identically, losing semantic information. This matters when `null` means "unknown" but empty string means "intentionally blank." 3. Boolean values in inconsistent formats. Some APIs return booleans as `true/false`, others as `1/0`, others as `"true"/"false"` strings. When converting to CSV, you need consistent representation. I've seen pipelines where boolean fields ended up with five different values: `true`, `false`, `1`, `0`, and empty string. 4. Numeric strings that should stay strings. Phone numbers, ZIP codes, and account IDs often look like numbers but must be treated as strings. Most converters see "00123" and output "123", losing the leading zeros. This breaks phone numbers, postal codes, and any identifier with semantic leading zeros. 5. Arrays of primitives vs. arrays of objects. An array like `["red", "blue", "green"]` needs different handling than `[{"color": "red"}, {"color": "blue"}]`. Most converters handle one or the other well, but not both. And some APIs return both formats for the same field depending on the number of items. 6. Deeply nested objects that exceed CSV's flat structure. JSON can nest arbitrarily deep: `user.profile.settings.notifications.email.frequency`. CSV needs flat columns. Most converters either flatten with dot notation (`user.profile.settings.notifications.email.frequency`) or give up and stringify the nested object. Neither approach handles schema changes well. 7. Polymorphic fields that change type. A field that's usually a string but occasionally an object. Example: an `address` field that's normally a string but becomes an object with `street`, `city`, `state` when the user provides detailed information. Most converters crash or coerce everything to strings. 8. Unicode and special characters in field names. JSON allows any Unicode string as a key. CSV has restrictions on column names depending on the tool consuming it. Field names with commas, quotes, newlines, or non-ASCII characters break many CSV parsers. I've seen pipelines fail because an API returned a field named `"user's name"` with a curly quote. 9. Timestamp formats that vary by record. APIs sometimes return timestamps in multiple formats: ISO 8601, Unix epoch, or custom formats. A converter that doesn't normalize these produces CSV files where the same column has incompatible date formats. Excel interprets some as dates and others as strings, breaking sorting and filtering. 10. Large numbers that exceed CSV precision. JavaScript's Number type loses precision above 2^53. If your JSON contains 64-bit integers (common in database IDs), they may get corrupted during conversion. I've seen order IDs and transaction IDs get rounded, causing lookups to fail. 11. Circular references in nested objects. Some APIs return JSON with circular references (object A references object B, which references object A). Most converters either crash or infinite loop. Even if you detect the cycle, how do you represent it in CSV? 12. Schema changes between records. The first JSON object has fields A, B, C. The second has fields A, B, D. Most converters create columns based on the first record, then either ignore field D or crash when they encounter it. Real-world APIs evolve, and your converter needs to handle schema drift gracefully.

Each of these edge cases seems minor in isolation. But in a production pipeline processing millions of records, even a 0.01% failure rate means thousands of corrupted records. And these failures compound—a record that hits two edge cases is more likely to be completely corrupted than one that hits only one.

Testing Strategies That Actually Catch These Issues

Most teams test JSON-to-CSV conversion with happy-path data: well-formed JSON with consistent schemas. This catches obvious bugs but misses the edge cases that cause production issues.

Here's my testing strategy for production pipelines:

Generate adversarial test cases programmatically. Don't just test with real data samples. Create synthetic test cases that specifically target edge cases. I maintain a test suite with 200+ JSON files that cover every edge case I've encountered. Each file is designed to break naive converters in a specific way. Use property-based testing. Tools like Hypothesis (Python) or fast-check (JavaScript) generate random test inputs and verify invariants. For JSON-to-CSV conversion, useful invariants include: "every JSON object produces at least one CSV row," "no data is silently dropped," and "the conversion is deterministic." Test with production data samples. Synthetic tests catch known edge cases, but production data reveals unknown ones. I regularly sample 10,000 random records from production, run them through the converter, and validate the output. This catches issues like "field X is usually a string but was an object in 0.01% of records." Validate round-trip conversion when possible. If your use case allows, convert JSON to CSV and back to JSON, then compare the results. This won't work for all use cases (CSV is lossy), but it catches many issues. Any data that changes during round-trip conversion is a potential bug. Monitor conversion metrics in production. Track: records processed, records skipped, fields with null values, fields with type mismatches, and conversion time. Set up alerts for anomalies. If the null rate for a field suddenly jumps from 1% to 10%, something changed in the source data.

Testing data pipelines is fundamentally different from testing application code. You're not testing "does this function return the correct output for this input?" You're testing "does this function handle the infinite variety of real-world data without corrupting it?" The latter requires different techniques.

I've caught more bugs with property-based testing and production sampling than with traditional unit tests. Unit tests verify that your code does what you think it should do. Property-based testing verifies that your code handles inputs you didn't think about.

Architectural Patterns for Reliable Conversion

Beyond testing, the architecture of your conversion pipeline matters. Here's how I structure production pipelines to minimize data corruption:

Separate parsing from transformation from validation. Most converters do all three in one step, which makes debugging impossible. I split the pipeline into stages: (1) parse JSON and validate schema, (2) transform to intermediate representation, (3) validate transformation, (4) write CSV. Each stage can be tested and monitored independently. Use explicit schemas, not inference. Don't let the converter infer the schema from the first record. Define the expected schema explicitly: field names, types, nullability, and how to handle nested structures. This makes the conversion predictable and catches schema drift early. Implement quarantine for bad records. When a record fails conversion, don't crash the entire pipeline. Write the bad record to a quarantine location (S3 bucket, database table, etc.) with metadata about why it failed. This lets you continue processing good records while investigating failures. Version your conversion logic. When you change how the converter handles an edge case, version the change. This lets you reprocess historical data with the new logic and compare results. I've caught bugs by reprocessing old data with new conversion logic and finding discrepancies. Make conversion idempotent. Running the converter twice on the same input should produce identical output. This seems obvious but is hard to achieve when the converter maintains state (like tracking variant IDs). Idempotency makes it safe to reprocess data when you discover bugs.

The fintech client I mentioned earlier now has a conversion pipeline with these characteristics. It's more complex than their original solution—about 500 lines of code vs. 50—but it's caught dozens of data quality issues before they reached production. The investment paid for itself in the first month.

The Validation Script I Run Before Every Import

Here's the validation script I run before importing any CSV file generated from JSON. This catches most corruption issues before they contaminate your data warehouse:

1. Schema validation: Verify column names and count. Compare the CSV columns against your expected schema. Are all required columns present? Are there unexpected columns? Column count mismatches often indicate the converter failed to handle nested structures correctly. 2. Type validation: Check that each column contains the expected data type. For numeric columns, verify all values are valid numbers. For date columns, verify all values parse as dates. For enum columns, verify all values are in the allowed set. Type mismatches indicate the converter coerced data incorrectly. 3. Null rate validation: Compare null rates against historical baselines. If a column that's usually 99% populated is suddenly 50% null, something went wrong. This catches issues where the converter silently dropped data or failed to extract nested fields. 4. Uniqueness validation: Verify that ID columns are unique. If your JSON has unique IDs but your CSV has duplicates, the converter probably duplicated records when flattening nested arrays. This is the bug that cost my fintech client $1.2M. 5. Referential integrity validation: Check that foreign keys exist. If your CSV has order_id and customer_id columns, verify that every order_id appears in your orders table and every customer_id appears in your customers table. Broken references indicate data corruption. 6. Range validation: Verify numeric values are within expected ranges. Negative inventory counts, dates in the future, percentages over 100—these indicate conversion bugs. I've caught issues where the converter parsed "1,234" as "1" because it didn't handle thousand separators. 7. Consistency validation: Check that related fields are consistent. If you have both `total_amount` and `line_items` with individual amounts, verify they sum correctly. If you have `created_at` and `updated_at`, verify `updated_at >= created_at`. Inconsistencies indicate the converter mixed up data from different records. 8. Sample validation: Manually inspect random samples. Automated validation catches known issues, but manual inspection catches unknown ones. I randomly sample 100 records, look up the original JSON, and verify the CSV matches. This is tedious but catches subtle bugs. 9. Comparison validation: Compare against previous imports. If you're importing daily data, compare today's import against yesterday's. Are the distributions similar? Are there sudden spikes or drops? This catches issues where the source API changed format without warning. 10. Round-trip validation: Convert CSV back to JSON and compare. This only works if your conversion is reversible, but when it does, it's the most reliable validation. Any differences between the original JSON and the round-tripped JSON indicate data loss.

I run these validations in a pre-import script that takes about 30 seconds for a typical dataset. If any validation fails, the script blocks the import and alerts the team. This has saved me from importing corrupted data dozens of times.

The script is simple—mostly SQL queries and Python assertions—but it's the difference between catching data corruption before it spreads and explaining to your CEO why your quarterly numbers are wrong. In data engineering, boring validation scripts are more valuable than clever algorithms.

---

JSON-to-CSV conversion seems trivial until you encounter real-world data. Then it becomes a minefield of edge cases, each capable of silently corrupting your pipeline. The solution isn't better libraries—it's treating conversion as a critical system component that deserves rigorous testing, explicit error handling, and continuous validation.

Every production pipeline I've built includes comprehensive validation at the conversion step. It's not glamorous work, but it's the work that keeps your data trustworthy. And in a world where data drives decisions, trustworthy data is everything.

Disclaimer: This article is for informational purposes only. While we strive for accuracy, technology evolves rapidly. Always verify critical information from official sources. Some links may be affiliate links.