Data Formatting Tips for Analysis with LLMs

Embrace Tidy Data Principles

LLMs perform best with well-structured, tidy data that follows these principles:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

Tidy Example: Daily Temperature Readings

datecitytemperature_c
2024-01-01New York3.2
2024-01-01San Diego15.8
2024-01-02New York2.4
2024-01-02San Diego16.1

❌ Poorly Formatted Example 1: Column Headers as Values

dateNew YorkSan Diego
2024-01-013.215.8
2024-01-022.416.1

Why it’s bad:

  • City names are column headers, not values in a column.
  • Makes it hard to filter or aggregate by city.

❌ Poorly Formatted Example 2: Repeating Column Names for Each Variable

dateNew York TempSan Diego TempNew York HumiditySan Diego Humidity
2024-01-013.215.865%50%
2024-01-022.416.170%55%

Why it’s bad:

  • Column names encode multiple dimensions (city, variable).
  • Hard to reshape, filter by city and date.

❌ Poorly Formatted Example 3: Redundant Rows and Mixed Observations

datecitytemperature_ccomment
2024-01-01New York3.2Cold morning
2024-01-01New York3.2Windy
2024-01-01San Diego15.8Warm and sunny

Why it’s bad:

  • Duplicate measurements with multiple rows per observation.
  • Comments should be separate if they are different types of observations.

❌ Poorly Formatted Example 4: Transposed Layout

variableNew YorkSan Diego
2024-01-013.215.8
2024-01-022.416.1

Why it’s bad:

  • Dates are values in a column, not a proper variable column.
  • Difficult to filter by date or apply time-based operations.

❌ Poorly Formatted Example 5: Multiple Variables in One Column

datelocation_temp
2024-01-01New York:3.2
2024-01-01San Diego:15.8
2024-01-02New York:2.4

Why it’s bad:

  • The location_temp column combines two variables.
  • Requires string parsing to separate city and temperature.