How you structure your data is just as important as the data itself.
Large Language Models (LLMs) like Claude, ChatGPT, and others have revolutionized the way we analyze data by generating code on demand—but their effectiveness depends heavily on how well your data is formatted.
This guide explores essential data formatting practices that optimize your datasets for AI-powered analysis. By following these principles, you'll enable AI assistants to generate more accurate, efficient, and useful code with fewer iterations and clarifications. Well-formatted data allows AI models to:
Let's dive into the critical formatting practices that set your data up for successful and accurate analysis with LLMs.
LLMs work more efficiently with straightforward tabular data formats:
date | city | temperature_c | humidity_pct |
---|---|---|---|
2024-01-01 | New York | 3.2 | 65 |
2024-01-01 | San Diego | 15.8 | 50 |
2024-01-02 | New York | 2.4 | 70 |
2024-01-02 | San Diego | 16.1 | 55 |
{
"readings": [
{
"date": "2024-01-01",
"locations": {
"New York": {
"temperature_c": 3.2,
"humidity_pct": 65
},
"San Diego": {
"temperature_c": 15.8,
"humidity_pct": 50
}
}
},
{
"date": "2024-01-02",
"locations": {
"New York": {
"temperature_c": 2.4,
"humidity_pct": 70
},
"San Diego": {
"temperature_c": 16.1,
"humidity_pct": 55
}
}
}
]
}
Why it's bad:
Position your data in the top-left corner and eliminate unnecessary blank spaces:
The table begins in cell D4 instead of A1, with empty rows and columns:
![Visual representation of an Excel spreadsheet with data not starting in cell A1]
A | B | C | D | E | F | G | H |
---|---|---|---|---|---|---|---|
1 | |||||||
2 | |||||||
3 | |||||||
4 | date | city | temperature_c | humidity_pct | |||
5 | 2024-01-01 | New York | 3.2 | 65 | |||
6 | 2024-01-01 | San Diego | 15.8 | 50 | |||
7 | 2024-01-02 | New York | 2.4 | 70 | |||
8 | |||||||
9 | 2024-01-02 | San Diego | 55 |
Why it's bad:
Keep your data focused and avoid mixing different types of information:
This file mixes data tables with text explanations and notes:
A | B | C | D | |
---|---|---|---|---|
1 | DAILY TEMPERATURE AND HUMIDITY READINGS | |||
2 | Collected by Weather Monitoring Station | |||
3 | Contact: weather@example.com | |||
4 | ||||
5 | date | city | temperature_c | humidity_pct |
6 | 2024-01-01 | New York | 3.2 | 65 |
7 | 2024-01-01 | San Diego | 15.8 | 50 |
8 | ||||
9 | NOTES: | |||
10 | - New York had light precipitation on Jan 1 | |||
11 | - San Diego measurements taken at coastal station | |||
12 | ||||
13 | date | city | temperature_c | humidity_pct |
14 | 2024-01-02 | New York | 2.4 | 70 |
15 | 2024-01-02 | San Diego | 16.1 | 55 |
Why it's bad:
Clear, descriptive column headers improve code generation accuracy:
measurement_date | city_name | temperature_celsius | relative_humidity_percent |
---|---|---|---|
2024-01-01 | New York | 3.2 | 65 |
2024-01-01 | San Diego | 15.8 | 50 |
2024-01-02 | New York | 2.4 | 70 |
2024-01-02 | San Diego | 16.1 | 55 |
dt | loc | tmp | rh | p_mb | ||
---|---|---|---|---|---|---|
1/1 | NY | 3.2 | 65 | 1013 | 12 | Light morning frost |
1/1 | SD | 15.8 | 50 | 1012 | 8 | Mild coastal breeze |
2-Jan | NY | 2.4 | 70 | 1010 | 15 | Overcast conditions |
2-Jan | SD | 16.1 | 55 | 1011 | 10 | Partly cloudy, humid |
Why it's bad:
dt
, tmp
, rh
) are ambiguous.1/1
vs 2-Jan
).p_mb
means pressure in millibars.LLMs perform best with well-structured, tidy data that follows these principles:
date | city | temperature_c |
---|---|---|
2024-01-01 | New York | 3.2 |
2024-01-01 | San Diego | 15.8 |
2024-01-02 | New York | 2.4 |
2024-01-02 | San Diego | 16.1 |
date | New York | San Diego |
---|---|---|
2024-01-01 | 3.2 | 15.8 |
2024-01-02 | 2.4 | 16.1 |
Why it’s bad:
date | New York Temp | San Diego Temp | New York Humidity | San Diego Humidity |
---|---|---|---|---|
2024-01-01 | 3.2 | 15.8 | 65% | 50% |
2024-01-02 | 2.4 | 16.1 | 70% | 55% |
Why it’s bad:
city
, variable
).date | city | temperature_c | comment |
---|---|---|---|
2024-01-01 | New York | 3.2 | Cold morning |
2024-01-01 | New York | 3.2 | Windy |
2024-01-01 | San Diego | 15.8 | Warm and sunny |
Why it’s bad:
variable | New York | San Diego |
---|---|---|
2024-01-01 | 3.2 | 15.8 |
2024-01-02 | 2.4 | 16.1 |
Why it’s bad:
date | location_temp |
---|---|
2024-01-01 | New York:3.2 |
2024-01-01 | San Diego:15.8 |
2024-01-02 | New York:2.4 |
Why it’s bad:
location_temp
column combines two variables.