Data Formatting Best Practices for Analysis with LLMs

How you structure your data is just as important as the data itself.

Large Language Models (LLMs) like Claude, ChatGPT, and others have revolutionized the way we analyze data by generating code on demand—but their effectiveness depends heavily on how well your data is formatted.

This guide explores essential data formatting practices that optimize your datasets for AI-powered analysis. By following these principles, you'll enable AI assistants to generate more accurate, efficient, and useful code with fewer iterations and clarifications. Well-formatted data allows AI models to:

  • Understand your dataset's structure more quickly
  • Generate more accurate code on the first attempt
  • Produce analyses that actually answer your questions
  • Minimize preprocessing steps and error handling

Let's dive into the critical formatting practices that set your data up for successful and accurate analysis with LLMs.

Use Tabular Data Formats

LLMs work more efficiently with straightforward tabular data formats:

  1. CSV and Excel files are directly parsable
  2. Parquet files offer efficient columnar storage
  3. Avoid nested formats when simple tables will suffice

Tabular Example: Daily Temperature Readings (CSV)

datecitytemperature_chumidity_pct
2024-01-01New York3.265
2024-01-01San Diego15.850
2024-01-02New York2.470
2024-01-02San Diego16.155

❌ Poorly Formatted Example 6: Nested JSON Structure

{
  "readings": [
    {
      "date": "2024-01-01",
      "locations": {
        "New York": {
          "temperature_c": 3.2,
          "humidity_pct": 65
        },
        "San Diego": {
          "temperature_c": 15.8,
          "humidity_pct": 50
        }
      }
    },
    {
      "date": "2024-01-02",
      "locations": {
        "New York": {
          "temperature_c": 2.4,
          "humidity_pct": 70
        },
        "San Diego": {
          "temperature_c": 16.1,
          "humidity_pct": 55
        }
      }
    }
  ]
}

Why it's bad:

  • Deeply nested structures require complex transformations.
  • LLMs need to write more code to extract and analyze data.
  • Harder to visualize the full dataset at a glance.

Avoid Empty Rows and Columns

Position your data in the top-left corner and eliminate unnecessary blank spaces:

  1. Start your table in row 1, column 1 (cell A1 in spreadsheets)
  2. Remove completely empty rows and columns
  3. Fill missing values appropriately or use standard null indicators
  4. Avoid hidden rows, columns, or worksheets

❌ Poorly Formatted Example: Misplaced Table with Empty Areas

The table begins in cell D4 instead of A1, with empty rows and columns:

![Visual representation of an Excel spreadsheet with data not starting in cell A1]

ABCDEFGH
1
2
3
4datecitytemperature_chumidity_pct
52024-01-01New York3.265
62024-01-01San Diego15.850
72024-01-02New York2.470
8
92024-01-02San Diego55

Why it's bad:

  • Table doesn't start in cell A1, making it harder for code to locate and process
  • Three empty rows at the top create confusion for automatic parsing
  • Three empty columns before the data waste space and complicate analysis
  • Empty row in the middle (row 8) disrupts data continuity
  • Missing temperature value for San Diego on Jan 2 requires special handling
  • LLMs must first locate the actual data boundaries before analysis can begin

Avoid Extraneous Information in the Same File

Keep your data focused and avoid mixing different types of information:

  1. Separate metadata from actual data
  2. Avoid explanatory text or instructions within data files
  3. Use separate files for different data domains

❌ Poorly Formatted Example: Spreadsheet with Mixed Content Types

This file mixes data tables with text explanations and notes:

ABCD
1DAILY TEMPERATURE AND HUMIDITY READINGS
2Collected by Weather Monitoring Station
3Contact: weather@example.com
4
5datecitytemperature_chumidity_pct
62024-01-01New York3.265
72024-01-01San Diego15.850
8
9NOTES:
10- New York had light precipitation on Jan 1
11- San Diego measurements taken at coastal station
12
13datecitytemperature_chumidity_pct
142024-01-02New York2.470
152024-01-02San Diego16.155

Why it's bad:

  • Header text and notes interrupt the data structure
  • Data is split into two separate tables that should be one continuous dataset
  • Text content cannot be parsed as tabular data
  • Requires manual preprocessing to extract and combine the actual data
  • LLMs must determine which parts are actual data versus supplementary information
  • Makes automated analysis significantly more complex and error-prone

Include Meaningful Column Names or Headers

Clear, descriptive column headers improve code generation accuracy:

  1. Use concise yet descriptive names
  2. Avoid spaces (use underscores or camelCase instead)
  3. Be consistent with naming conventions
  4. Include units of measurement when relevant

Good Example: Descriptive Column Headers

measurement_datecity_nametemperature_celsiusrelative_humidity_percent
2024-01-01New York3.265
2024-01-01San Diego15.850
2024-01-02New York2.470
2024-01-02San Diego16.155

❌ Poorly Formatted Example: Ambiguous or Missing Column Headers

dtloctmprhp_mb
1/1NY3.265101312Light morning frost
1/1SD15.85010128Mild coastal breeze
2-JanNY2.470101015Overcast conditions
2-JanSD16.155101110Partly cloudy, humid

Why it's bad:

  • Abbreviated headers (dt, tmp, rh) are ambiguous.
  • Missing headers for the last two columns.
  • Inconsistent date formats (1/1 vs 2-Jan).
  • Units of measurement are missing from column names.
  • No indication that p_mb means pressure in millibars.
  • Hard for LLMs to understand column meanings without context.

Embrace Tidy Data Principles

LLMs perform best with well-structured, tidy data that follows these principles:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

Tidy Example: Daily Temperature Readings

datecitytemperature_c
2024-01-01New York3.2
2024-01-01San Diego15.8
2024-01-02New York2.4
2024-01-02San Diego16.1

❌ Poorly Formatted Example: Column Headers as Values

dateNew YorkSan Diego
2024-01-013.215.8
2024-01-022.416.1

Why it’s bad:

  • City names are column headers, not values in a column.
  • Makes it hard to filter or aggregate by city.

❌ Poorly Formatted Example: Repeating Column Names for Each Variable

dateNew York TempSan Diego TempNew York HumiditySan Diego Humidity
2024-01-013.215.865%50%
2024-01-022.416.170%55%

Why it’s bad:

  • Column names encode multiple dimensions (city, variable).
  • Hard to reshape, filter by city and date.

❌ Poorly Formatted Example: Redundant Rows and Mixed Observations

datecitytemperature_ccomment
2024-01-01New York3.2Cold morning
2024-01-01New York3.2Windy
2024-01-01San Diego15.8Warm and sunny

Why it’s bad:

  • Duplicate measurements with multiple rows per observation.
  • Comments should be separate if they are different types of observations.

❌ Poorly Formatted Example: Transposed Layout

variableNew YorkSan Diego
2024-01-013.215.8
2024-01-022.416.1

Why it’s bad:

  • Dates are values in a column, not a proper variable column.
  • Difficult to filter by date or apply time-based operations.

❌ Poorly Formatted Example: Multiple Variables in One Column

datelocation_temp
2024-01-01New York:3.2
2024-01-01San Diego:15.8
2024-01-02New York:2.4

Why it’s bad:

  • The location_temp column combines two variables.
  • Requires string parsing to separate city and temperature.