[{"data":1,"prerenderedAt":703},["ShallowReactive",2],{"content-query-4i5XsAwiiy":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"heading":10,"prompt":11,"tags":15,"files":17,"nav":17,"presets":18,"gallery":36,"body":38,"_type":696,"_id":697,"_source":698,"_file":699,"_stem":700,"_extension":701,"sitemap":702},"/tools/pca","tools",false,"","PCA Calculator for Excel & CSV","Run principal component analysis online from Excel or CSV data. Reduce dimensions, visualize clusters, and inspect loadings with AI.","PCA",{"prefix":12,"label":13,"placeholder":14},"Run a PCA","Describe the PCA you want to run","e.g. PCA of country development indicators, color by income group, show biplot and scree plot",[16],"statistics",true,[19,25,30],{"label":20,"prompt":21,"dataset_url":22,"dataset_title":23,"dataset_citation":24},"Country development indicators","PCA of country development indicators: log GDP per capita, life expectancy, CO2 emissions, education index; standardize all variables; biplot colored by world region; scree plot; list top loadings for PC1 and PC2","https://ourworldindata.org/grapher/life-expectancy-vs-gdp-per-capita.csv","Life expectancy vs. GDP per capita","Our World in Data",{"label":26,"prompt":27,"dataset_url":28,"dataset_title":29,"dataset_citation":24},"Energy mix by country","PCA on share of electricity from fossil fuels, nuclear, solar, wind, hydro by country; standardize; biplot colored by continent; show which energy sources drive PC1 and PC2; scree plot with 80% variance line","https://ourworldindata.org/grapher/share-of-electricity-production-by-source.csv","Share of electricity production by source",{"label":31,"prompt":32,"dataset_url":33,"dataset_title":34,"dataset_citation":35},"World Bank economic indicators","PCA on GDP growth, inflation, trade openness, government expenditure, and current account balance by country; standardize; biplot colored by income group; identify outlier countries; show loadings heatmap","https://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.KD.ZG?downloadformat=excel","GDP growth (annual %)","World Bank",[37],"/img/tools/pca.png",{"type":39,"children":40,"toc":686},"root",[41,50,69,102,107,113,171,177,365,371,479,485,539,545,574,580,611,628,652,662],{"type":42,"tag":43,"props":44,"children":46},"element","h2",{"id":45},"what-is-pca",[47],{"type":48,"value":49},"text","What Is PCA?",{"type":42,"tag":51,"props":52,"children":53},"p",{},[54,60,62,67],{"type":42,"tag":55,"props":56,"children":57},"strong",{},[58],{"type":48,"value":59},"Principal Component Analysis (PCA)",{"type":48,"value":61}," is a dimensionality reduction technique that transforms a dataset with many correlated variables into a smaller set of ",{"type":42,"tag":55,"props":63,"children":64},{},[65],{"type":48,"value":66},"uncorrelated principal components",{"type":48,"value":68}," that capture the maximum possible variance. The first principal component (PC1) is the direction in the data that explains the most variance; the second (PC2) is perpendicular to PC1 and explains the next most variance; and so on. The result is a new coordinate system where each axis is a linear combination of the original variables, ordered by how much information they carry.",{"type":42,"tag":51,"props":70,"children":71},{},[72,74,79,81,86,88,93,95,100],{"type":48,"value":73},"The main uses of PCA are ",{"type":42,"tag":55,"props":75,"children":76},{},[77],{"type":48,"value":78},"visualization",{"type":48,"value":80},", ",{"type":42,"tag":55,"props":82,"children":83},{},[84],{"type":48,"value":85},"noise reduction",{"type":48,"value":87},", and ",{"type":42,"tag":55,"props":89,"children":90},{},[91],{"type":48,"value":92},"feature engineering",{"type":48,"value":94},". For visualization: if you have 10 measurements per country (GDP, emissions, life expectancy, education, etc.), you cannot plot them all at once. PCA compresses them to 2 or 3 components that you can scatter-plot, and the resulting clusters reveal which countries are similar across all dimensions simultaneously. The ",{"type":42,"tag":55,"props":96,"children":97},{},[98],{"type":48,"value":99},"biplot",{"type":48,"value":101}," overlays loading arrows on this scatter — an arrow pointing right along PC1 means that variable contributes positively to the first component, and its length shows how much. For noise reduction: keeping only the top few components that explain 80–90% of the variance removes low-signal dimensions. For feature engineering: PCA components can replace the original variables as inputs to a regression or clustering model.",{"type":42,"tag":51,"props":103,"children":104},{},[105],{"type":48,"value":106},"PCA is used in genomics (reducing thousands of gene expression values to a handful of components before clustering samples), economics (building composite development indices from many indicators), neuroscience (finding dominant patterns in brain activity recordings), and computer vision (eigenfaces — representing faces as combinations of a small number of prototypical face patterns). In all cases, the goal is the same: find a compact representation that captures most of the structure in the data.",{"type":42,"tag":43,"props":108,"children":110},{"id":109},"how-it-works",[111],{"type":48,"value":112},"How It Works",{"type":42,"tag":114,"props":115,"children":116},"ol",{},[117,128,144],{"type":42,"tag":118,"props":119,"children":120},"li",{},[121,126],{"type":42,"tag":55,"props":122,"children":123},{},[124],{"type":48,"value":125},"Upload your data",{"type":48,"value":127}," — provide a CSV or Excel file with multiple numeric columns. One row per observation (country, patient, sample, etc.). The AI will standardize the variables automatically.",{"type":42,"tag":118,"props":129,"children":130},{},[131,136,138],{"type":42,"tag":55,"props":132,"children":133},{},[134],{"type":48,"value":135},"Describe the analysis",{"type":48,"value":137}," — e.g. ",{"type":42,"tag":139,"props":140,"children":141},"em",{},[142],{"type":48,"value":143},"\"PCA on all numeric columns, color observations by region, show biplot and scree plot, list the top 3 loadings for PC1\"",{"type":42,"tag":118,"props":145,"children":146},{},[147,152,154,161,163,169],{"type":42,"tag":55,"props":148,"children":149},{},[150],{"type":48,"value":151},"Get full results",{"type":48,"value":153}," — the AI writes Python code using ",{"type":42,"tag":155,"props":156,"children":158},"a",{"href":157},"https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html",[159],{"type":48,"value":160},"scikit-learn",{"type":48,"value":162}," for PCA and ",{"type":42,"tag":155,"props":164,"children":166},{"href":165},"https://plotly.com/python/",[167],{"type":48,"value":168},"Plotly",{"type":48,"value":170}," to produce the biplot, scree plot, and loadings table",{"type":42,"tag":43,"props":172,"children":174},{"id":173},"interpreting-the-results",[175],{"type":48,"value":176},"Interpreting the Results",{"type":42,"tag":178,"props":179,"children":180},"table",{},[181,200],{"type":42,"tag":182,"props":183,"children":184},"thead",{},[185],{"type":42,"tag":186,"props":187,"children":188},"tr",{},[189,195],{"type":42,"tag":190,"props":191,"children":192},"th",{},[193],{"type":48,"value":194},"Output",{"type":42,"tag":190,"props":196,"children":197},{},[198],{"type":48,"value":199},"What it means",{"type":42,"tag":201,"props":202,"children":203},"tbody",{},[204,221,237,253,269,285,301,317,333,349],{"type":42,"tag":186,"props":205,"children":206},{},[207,216],{"type":42,"tag":208,"props":209,"children":210},"td",{},[211],{"type":42,"tag":55,"props":212,"children":213},{},[214],{"type":48,"value":215},"PC1, PC2 scatter (biplot)",{"type":42,"tag":208,"props":217,"children":218},{},[219],{"type":48,"value":220},"Each point is an observation projected onto the two most important directions",{"type":42,"tag":186,"props":222,"children":223},{},[224,232],{"type":42,"tag":208,"props":225,"children":226},{},[227],{"type":42,"tag":55,"props":228,"children":229},{},[230],{"type":48,"value":231},"Clusters in biplot",{"type":42,"tag":208,"props":233,"children":234},{},[235],{"type":48,"value":236},"Observations that are similar across all original variables",{"type":42,"tag":186,"props":238,"children":239},{},[240,248],{"type":42,"tag":208,"props":241,"children":242},{},[243],{"type":42,"tag":55,"props":244,"children":245},{},[246],{"type":48,"value":247},"Loading arrow",{"type":42,"tag":208,"props":249,"children":250},{},[251],{"type":48,"value":252},"How much a variable contributes to that principal component",{"type":42,"tag":186,"props":254,"children":255},{},[256,264],{"type":42,"tag":208,"props":257,"children":258},{},[259],{"type":42,"tag":55,"props":260,"children":261},{},[262],{"type":48,"value":263},"Long arrow parallel to PC1",{"type":42,"tag":208,"props":265,"children":266},{},[267],{"type":48,"value":268},"This variable drives the main dimension of variation",{"type":42,"tag":186,"props":270,"children":271},{},[272,280],{"type":42,"tag":208,"props":273,"children":274},{},[275],{"type":42,"tag":55,"props":276,"children":277},{},[278],{"type":48,"value":279},"Short arrow",{"type":42,"tag":208,"props":281,"children":282},{},[283],{"type":48,"value":284},"Variable contributes little to the top components",{"type":42,"tag":186,"props":286,"children":287},{},[288,296],{"type":42,"tag":208,"props":289,"children":290},{},[291],{"type":42,"tag":55,"props":292,"children":293},{},[294],{"type":48,"value":295},"Two arrows pointing same direction",{"type":42,"tag":208,"props":297,"children":298},{},[299],{"type":48,"value":300},"Those variables are positively correlated",{"type":42,"tag":186,"props":302,"children":303},{},[304,312],{"type":42,"tag":208,"props":305,"children":306},{},[307],{"type":42,"tag":55,"props":308,"children":309},{},[310],{"type":48,"value":311},"Two arrows pointing opposite directions",{"type":42,"tag":208,"props":313,"children":314},{},[315],{"type":48,"value":316},"Those variables are negatively correlated",{"type":42,"tag":186,"props":318,"children":319},{},[320,328],{"type":42,"tag":208,"props":321,"children":322},{},[323],{"type":42,"tag":55,"props":324,"children":325},{},[326],{"type":48,"value":327},"Scree plot bar height",{"type":42,"tag":208,"props":329,"children":330},{},[331],{"type":48,"value":332},"% of total variance explained by each component",{"type":42,"tag":186,"props":334,"children":335},{},[336,344],{"type":42,"tag":208,"props":337,"children":338},{},[339],{"type":42,"tag":55,"props":340,"children":341},{},[342],{"type":48,"value":343},"Cumulative variance line",{"type":42,"tag":208,"props":345,"children":346},{},[347],{"type":48,"value":348},"How many components are needed to explain 80% / 90% of variance",{"type":42,"tag":186,"props":350,"children":351},{},[352,360],{"type":42,"tag":208,"props":353,"children":354},{},[355],{"type":42,"tag":55,"props":356,"children":357},{},[358],{"type":48,"value":359},"Elbow in scree plot",{"type":42,"tag":208,"props":361,"children":362},{},[363],{"type":48,"value":364},"Natural cutoff — components after the elbow explain little additional variance",{"type":42,"tag":43,"props":366,"children":368},{"id":367},"example-prompts",[369],{"type":48,"value":370},"Example Prompts",{"type":42,"tag":178,"props":372,"children":373},{},[374,390],{"type":42,"tag":182,"props":375,"children":376},{},[377],{"type":42,"tag":186,"props":378,"children":379},{},[380,385],{"type":42,"tag":190,"props":381,"children":382},{},[383],{"type":48,"value":384},"Scenario",{"type":42,"tag":190,"props":386,"children":387},{},[388],{"type":48,"value":389},"What to type",{"type":42,"tag":201,"props":391,"children":392},{},[393,411,428,445,462],{"type":42,"tag":186,"props":394,"children":395},{},[396,401],{"type":42,"tag":208,"props":397,"children":398},{},[399],{"type":48,"value":400},"Country comparison",{"type":42,"tag":208,"props":402,"children":403},{},[404],{"type":42,"tag":405,"props":406,"children":408},"code",{"className":407},[],[409],{"type":48,"value":410},"PCA of GDP, life expectancy, CO2, education by country; biplot colored by continent",{"type":42,"tag":186,"props":412,"children":413},{},[414,419],{"type":42,"tag":208,"props":415,"children":416},{},[417],{"type":48,"value":418},"Genomics",{"type":42,"tag":208,"props":420,"children":421},{},[422],{"type":42,"tag":405,"props":423,"children":425},{"className":424},[],[426],{"type":48,"value":427},"PCA of gene expression data; color samples by tissue type; show top 20 gene loadings",{"type":42,"tag":186,"props":429,"children":430},{},[431,436],{"type":42,"tag":208,"props":432,"children":433},{},[434],{"type":48,"value":435},"Survey data",{"type":42,"tag":208,"props":437,"children":438},{},[439],{"type":42,"tag":405,"props":440,"children":442},{"className":441},[],[443],{"type":48,"value":444},"PCA of all Likert-scale survey responses; scree plot; name components by top loadings",{"type":42,"tag":186,"props":446,"children":447},{},[448,453],{"type":42,"tag":208,"props":449,"children":450},{},[451],{"type":48,"value":452},"Feature reduction",{"type":42,"tag":208,"props":454,"children":455},{},[456],{"type":42,"tag":405,"props":457,"children":459},{"className":458},[],[460],{"type":48,"value":461},"PCA on all numeric features, keep enough components for 90% variance, show loading heatmap",{"type":42,"tag":186,"props":463,"children":464},{},[465,470],{"type":42,"tag":208,"props":466,"children":467},{},[468],{"type":48,"value":469},"Time series",{"type":42,"tag":208,"props":471,"children":472},{},[473],{"type":42,"tag":405,"props":474,"children":476},{"className":475},[],[477],{"type":48,"value":478},"PCA of monthly economic indicators by country, trace how countries moved over 20 years",{"type":42,"tag":43,"props":480,"children":482},{"id":481},"assumptions-to-check",[483],{"type":48,"value":484},"Assumptions to Check",{"type":42,"tag":486,"props":487,"children":488},"ul",{},[489,499,509,519,529],{"type":42,"tag":118,"props":490,"children":491},{},[492,497],{"type":42,"tag":55,"props":493,"children":494},{},[495],{"type":48,"value":496},"Numeric variables",{"type":48,"value":498}," — PCA requires all input columns to be numeric; categorical variables must be encoded or excluded",{"type":42,"tag":118,"props":500,"children":501},{},[502,507],{"type":42,"tag":55,"props":503,"children":504},{},[505],{"type":48,"value":506},"Standardization",{"type":48,"value":508}," — variables on different scales (e.g. GDP in thousands vs. rate in 0–1) must be standardized (mean=0, std=1) before PCA; the AI does this automatically unless you ask otherwise",{"type":42,"tag":118,"props":510,"children":511},{},[512,517],{"type":42,"tag":55,"props":513,"children":514},{},[515],{"type":48,"value":516},"Linear relationships",{"type":48,"value":518}," — PCA captures linear structure; if your variables have non-linear relationships, consider kernel PCA or UMAP",{"type":42,"tag":118,"props":520,"children":521},{},[522,527],{"type":42,"tag":55,"props":523,"children":524},{},[525],{"type":48,"value":526},"Sufficient sample size",{"type":48,"value":528}," — as a rule of thumb, at least 5–10 observations per variable; PCA on a 200-variable dataset with 20 rows is unreliable",{"type":42,"tag":118,"props":530,"children":531},{},[532,537],{"type":42,"tag":55,"props":533,"children":534},{},[535],{"type":48,"value":536},"No extreme outliers",{"type":48,"value":538}," — a few very extreme observations can dominate the first principal component; ask the AI to check for outliers before running PCA",{"type":42,"tag":43,"props":540,"children":542},{"id":541},"related-tools",[543],{"type":48,"value":544},"Related Tools",{"type":42,"tag":51,"props":546,"children":547},{},[548,550,556,558,564,566,572],{"type":48,"value":549},"Use the ",{"type":42,"tag":155,"props":551,"children":553},{"href":552},"/tools/pair-plot",[554],{"type":48,"value":555},"Pair Plot Generator",{"type":48,"value":557}," to visually inspect pairwise correlations between variables before running PCA — heavily correlated variable pairs are where PCA adds the most value. Use the ",{"type":42,"tag":155,"props":559,"children":561},{"href":560},"/tools/exploratory-data-analysis-ai",[562],{"type":48,"value":563},"Exploratory Data Analysis tool",{"type":48,"value":565}," to get summary statistics and a correlation matrix to understand the data before PCA. Use the ",{"type":42,"tag":155,"props":567,"children":569},{"href":568},"/tools/ai-heatmap",[570],{"type":48,"value":571},"AI Heatmap Generator",{"type":48,"value":573}," to visualize the full loadings matrix (all variables × all components) as a color-coded grid.",{"type":42,"tag":43,"props":575,"children":577},{"id":576},"frequently-asked-questions",[578],{"type":48,"value":579},"Frequently Asked Questions",{"type":42,"tag":51,"props":581,"children":582},{},[583,588,590,595,597,602,604,609],{"type":42,"tag":55,"props":584,"children":585},{},[586],{"type":48,"value":587},"How many principal components should I keep?",{"type":48,"value":589},"\nThe standard approaches: (1) keep components until you've explained ",{"type":42,"tag":55,"props":591,"children":592},{},[593],{"type":48,"value":594},"80–90% of the variance",{"type":48,"value":596}," (read off the cumulative scree plot), (2) keep components before the ",{"type":42,"tag":55,"props":598,"children":599},{},[600],{"type":48,"value":601},"elbow",{"type":48,"value":603}," in the scree plot where the curve flattens, or (3) keep components with ",{"type":42,"tag":55,"props":605,"children":606},{},[607],{"type":48,"value":608},"eigenvalue > 1",{"type":48,"value":610}," (Kaiser criterion). For visualization, 2 components are always used regardless of variance explained — you just need to note how much information the biplot represents.",{"type":42,"tag":51,"props":612,"children":613},{},[614,619,621,626],{"type":42,"tag":55,"props":615,"children":616},{},[617],{"type":48,"value":618},"Do I need to standardize my variables before PCA?",{"type":48,"value":620},"\nYes, almost always. If variables have different scales (e.g. GDP in billions vs. a 0–100 index), the high-variance variable will dominate PC1 purely because of its scale, not because it's more important. Standardizing (subtracting mean, dividing by std) puts all variables on equal footing. The AI standardizes by default; mention ",{"type":42,"tag":139,"props":622,"children":623},{},[624],{"type":48,"value":625},"\"use raw values without standardizing\"",{"type":48,"value":627}," only if your variables are already on the same scale and you want variance to drive the components.",{"type":42,"tag":51,"props":629,"children":630},{},[631,636,638,643,645,650],{"type":42,"tag":55,"props":632,"children":633},{},[634],{"type":48,"value":635},"What is the difference between PCA and t-SNE or UMAP?",{"type":48,"value":637},"\nPCA is a ",{"type":42,"tag":55,"props":639,"children":640},{},[641],{"type":48,"value":642},"linear",{"type":48,"value":644}," method — it finds straight-line combinations of variables. It's fast, interpretable (you can read loadings), and preserves global structure (distances between distant clusters). t-SNE and UMAP are ",{"type":42,"tag":55,"props":646,"children":647},{},[648],{"type":48,"value":649},"non-linear",{"type":48,"value":651}," — they can unroll complex curved manifolds and reveal local cluster structure that PCA misses, but they distort global distances and their axes have no interpretable meaning. Start with PCA; switch to t-SNE/UMAP if PCA biplots show overlapping clusters that you suspect are actually distinct.",{"type":42,"tag":51,"props":653,"children":654},{},[655,660],{"type":42,"tag":55,"props":656,"children":657},{},[658],{"type":48,"value":659},"My PC1 explains 90%+ of variance — is that normal?",{"type":48,"value":661},"\nIt depends on the data. For highly correlated variables (like development indicators that all trend together), one component can dominate. This isn't wrong — it means the data mostly varies along one direction. Check the loadings: if all variables load strongly on PC1 in the same direction, it's a \"general level\" component (richer countries score higher on everything). PC2 then captures deviations from this pattern (e.g. high CO₂ relative to their income level).",{"type":42,"tag":51,"props":663,"children":664},{},[665,670,672,677,679,684],{"type":42,"tag":55,"props":666,"children":667},{},[668],{"type":48,"value":669},"Can I use PCA scores as inputs to another model?",{"type":48,"value":671},"\nYes — this is called ",{"type":42,"tag":55,"props":673,"children":674},{},[675],{"type":48,"value":676},"PCA preprocessing",{"type":48,"value":678}," or ",{"type":42,"tag":55,"props":680,"children":681},{},[682],{"type":48,"value":683},"PCA + regression",{"type":48,"value":685},". After running PCA, the AI can output the component scores as new columns that you can use as features in a regression, classification, or clustering model. This reduces multicollinearity (PCA components are orthogonal) and can improve model stability when you have many correlated predictors.",{"title":7,"searchDepth":687,"depth":687,"links":688},2,[689,690,691,692,693,694,695],{"id":45,"depth":687,"text":49},{"id":109,"depth":687,"text":112},{"id":173,"depth":687,"text":176},{"id":367,"depth":687,"text":370},{"id":481,"depth":687,"text":484},{"id":541,"depth":687,"text":544},{"id":576,"depth":687,"text":579},"markdown","content:tools:030.pca.md","content","tools/030.pca.md","tools/030.pca","md",{"loc":4},1775502468196]