Post content
🧠📊Data Analyst Interview Questions with Answers: Part-4 31. What is ETL process?🔄 ETL stands for Extract, Transform, Load. - Extract: Pulling data from sources (databases, APIs, files) 📤 - Transform: Cleaning, formatting, and applying business logic 🛠️ - Load: Saving the transformed data into a data warehouse or system 📥 It helps consolidate data for reporting and analysis. 32. What are some challenges in data cleaning?🚫 - Missing values 🤷 - Duplicates 👯 - Inconsistent formats (e.g., date formats, units) 🧩 - Outliers 📈 - Incorrect or incomplete data ❌ - Merging data from multiple sources 🤝 Cleaning is time-consuming but critical for accurate analysis. 33. What is data wrangling?🧹 Also known as data munging, it’s the process of transforming raw data into a usable format. Includes: - Cleaning ✨ - Reshaping 📐 - Combining datasets 🔗 - Dealing with missing values or outliers 🗑️ 34. How do you handle missing data?❓ - Remove rows/columns (if missingness is high) ✂️ - Imputation (mean, median, mode) 🔢 - Forward/backward fill➡️⬅️ - Using models (KNN, regression)🤖 - Always analyze why data is missing before deciding. 35. What is data normalization in Python?⚖️ Normalization scales numerical data to a common range (e.g., 0 to 1). Common methods: from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() normalized_data = scaler.fit_transform(data) Useful for ML models to prevent bias due to varying value scales. 36. Difference between .loc and .iloc in Pandas📍🔢 - .loc[]: Label-based indexing df.loc[2] # Row with label 2 df.loc[:, 'age'] # All rows, 'age' column - .iloc[]: Integer position-based indexing df.iloc[2] # Third row df.iloc[:, 1] # All rows, second column 37. How do you merge dataframes in Pandas?🤝 Using merge() or concat() pd.merge(df1, df2, on='id', how='inner') # SQL-style joins pd.concat([df1, df2], axis=0) # Stack rows Choose keys and join types (inner, left, outer) based on data structure. 38. Explain groupby() in Pandas📊 Used to group data and apply aggregation. df.groupby('category')['sales'].sum() Steps: 1. Split data into groups 🧩 2. Apply function (sum, mean, count) 🧮 3. Combine result 📈 39. What are NumPy arrays?➕ N-dimensional arrays used for fast numeric computation. Faster than Python lists and support vectorized operations. import numpy as np a = np.array([1, 2, 3]) 40. How to handle large datasets efficiently?🚀 - Use chunking (read_csv(..., chunksize=10000)) - Use NumPy or Dask for faster ops - Filter unnecessary columns early - Use vectorized operations instead of loops - Work with cloud data tools (BigQuery, Spark) 💬Tap ❤️ if this was helpful!