The science of data collection, data manipulation, and data analysis to produce a result that supports decision-making.
There is no decision-making without a result ("evidence"). This is the basis of evidence-based medicine and evidence-based practice.
"Drug A is better than Drug B to treat disease X." → Why? How did you know? There must be a prior, valid, sound study (a manuscript) backing the claim.
A value with no variability.
e.g. number of eyes, ears, fingers. Not useful for research — nothing varies to study.
The thing research is built on. We ask: ① Is there variability? ② What is the pattern? ③ What are the causes? ④ What are the consequences?
Variables expressed in the form of a number.
e.g. age, height, BMI, blood glucose level, cholesterol level, insulin level.
We ask the patient and the answer is usually a "word" description.
e.g. blood group, diseases.
Accepts the decimal number.
e.g. age, weight, height, blood pressure.
Does not accept decimals — it is the whole value. "Number of…": beds in a hospital, family members, drugs received, cigarettes smoked per day, hospital admissions, RBCs.
Cannot be put in order.
e.g. gender, residence, job title, blood group, disease.
Responses can be put in a definite / documented order.
e.g. grade (A+, A, B+, B), degree of burns, degree of pain, cancer stage.
In research, everything is modifiable except the methodology / analysis — that part shouldn't be changed (تعديل المنهجية = خطأ، يُرفض).
Question: how do we present data in an organized way?
Data with no organization — you can't extract any information from it; it's just records. For a small sample (e.g. 5 patients, asking each 3 questions) raw data may be acceptable, but research data is usually large.
For samples of 300 and above, use professional methods of data presentation — present data in the smallest space while extracting the most information.
The most famous and oldest method. Any table is built on 3 rows and 3 columns as its basic structure. The intersection of a row and a column is called a cell.
| Names of the variable (units) | Frequency (from sample) | Percent / relative frequency |
|---|---|---|
| Categories of variable | count per category | % per category |
| Total | n | 100% |
| Blood group | Frequency | Percent |
|---|---|---|
| A | 10 | (10 × 100) / 80 |
| B | 15 | (15 × 100) / 80 |
| AB | 25 | … |
| O | 30 | … |
| Total | 80 | 100% |
Accepts only one answer; selecting one prevents selecting any other.
e.g. gender, blood group, job title.
Allows more than one answer by choice.
e.g. preferred food.
We have 100 patients; ask each about DM, asthma, and hypertension. If all 100 have all 3 diseases, the percentages are 100% + 100% + 100% = 300% — perfectly possible here.
A good title is self-explanatory and answers four questions: what? who? where? when?
"Frequency distribution table for the age (what) of 20 patients (who) in Najran main hospital (where), 2025 (when)."
Whether categories are specific depends on the variable type:
Categories are specific. e.g. gender → F, M.
Not specific. e.g. age — if you ask 1800 patients their age you can't list 1800 rows, so you convert to intervals.
18–29, 30–39, 40–49 … each band is an interval.
If the variable has a documented cut-off point, use it (e.g. measuring patient BMI or blood pressure — classify on that established basis). Otherwise choose carefully, because:
Used only to present data measured over a time period — a certain variable (y) against a time unit (x). It shows the trend of data over time.
y = rainfall (per hectare); x = time. Temperature chart: peak at 8 pm.
Used for qualitative data, in both simple and complex tables. (Complex tables add F & M side by side, with a legend / key.)
Class has 5 females (4 pass) and 6 males. "Who is best by frequency?" → answer "Female" is a mistake, because group sizes differ. "Who is best by percent?" → Male = 80%, Female = 60%. Percent is usually correct; frequency is correct only in a simple table.
Blood group A → 2, O → 50. Comparing D⁺/D⁻: A → 1 = 50%, 1 = 50%; O → 20 = 40%, 30 = 60%. Prevalence of disease is "more in" group A? → that's a mistake: 50% of A is just one person, so the percent here is misleading. Watch out for tiny samples.
Used for qualitative data in a simple table. Each category is a sector — a specific angle of the circle (total = 360°).
The higher the frequency of a category, the larger its sector; the lower the frequency, the smaller the sector.
Male angle = (5 × 360) / 20 = 90°
Female angle = (15 × 360) / 20 = 270°
Used in correlation analysis. x = independent variable y = dependent variable.
x = exercise hrs, study hrs, BMI | y = weight loss, exam mark, blood pressure.
One representative value for the whole data. Other name: measure of central tendency. Goal: summarize the data into one typical value and help in comparison.
Comparing male vs female GPA averages (M = 3.6, F = 4.1) lets us conclude females have the higher average — even though we only had grouped percentages. The average condenses everything into one comparable value.
MR = (3 + 10) / 2 = 13 / 2 = 6.5 kg
Data 3,2,7,10,5 → MR = (2+10)/2 = 6 (reasonable).
Data 3,2,7,10,50 → MR = (2+50)/2 = 26 (biased by the outlier 50). An "extreme value / outlier" is a value out of trend, e.g. a hormonal level of 105 among 7,8,10,11,13.
The most frequent observation.
The middle value — observations below = observations above (50% each side).
Example: 2,7,1,4,5,8,2 → ordered 1,2,2,4,5,7,8 → n = 7 → rank = 8/2 = 4ᵗʰ → Median = 4
Example: ordered 1,2,3,4,5,7,10,12,15 → n = 8 → ranks 4ᵗʰ & 5ᵗʰ = 5,7 → (5+7)/2 = Median = 6
The "measure of justice" — the balance point of the data.
3,3,4,8,10 → x̄ = (3+3+4+8+10)/5 = 5.6
20,25,13,7,2 → x̄ = (20+25+13+7+2)/5 = 13.4
Qualitative data → Mode. Quantitative data → if it has extreme values use the Median; if no extreme values use the Mean.
Does an average alone describe data accurately? Not always.
Males scored 0 and 100; Females scored 40 and 60. Both have an average of 50%. But are they the same? No — the males have high variability. So an average alone is not enough; we report Average ± Dispersion.
Dispersion (variability) measures how spread out the data is around its average — مقاييس التشتت.
3,5,7,2,10 → Range = 10 − 2 = 8 (paired with an average).
Children's weights 5,3,2,7,1 kg → Range = 7 − 1 = 6 kg.
Used with the mean, reported as x̄ ± SD.
x̄ ± SD = 60 ± 10 kg → most cluster around 60, ranging 50 to 70.
7 ± 2 → most around 7, ranging 5 to 9 (e.g. hospital stay in days).
With the Mean (x̄) → use SD. With the Median → use the Range.
3,2,1,7,5 → ordered 1,2,3,5,7 → Median = 3, Range = 7 − 1 = 6 → reported as 3 (6).
Shape applies only to quantitative data. The presence of an extreme value changes the shape, which changes how we handle the data statistically.
When you collect natural data from people — e.g. ages 1,2,3,4,5 → x̄ = 3, Median = 3 → a normal distribution.
In a normal distribution: Mode = Mean (x̄) = Median. (More than one mode → bimodal/multimodal → not normal.)
68% (x̄ ± 1 SD): 110 − 10 = 100 → 110 + 10 = 120 → 100 to 120
95% (x̄ ± 2 SD): 110 − 20 = 90 → 110 + 20 = 130 → 90 to 130
99% (x̄ ± 3 SD): 110 − 30 = 80 → 110 + 30 = 140 → 80 to 140
If there's an extreme value, the data is not normal — it's skewed (not bell-shaped, not symmetric). The data has a tail; the tail's direction names the skew.
Long right tail; large extreme values at the upper limit. Most of the sample is in low values, a small portion in high.
e.g. income.
Long left tail; lower extreme values. Large sample in older age (>50), small sample in early age.
e.g. retirement age.
Normal: x̄ = Median = Mode
Right skewed: x̄ > Median > Mode (Mode < Median < x̄)
Left skewed: x̄ < Median < Mode
BMI data: x̄ = 35, Median = 20 → x̄ > Median → right skewed.
BMI data: x̄ = 26, Median = 31 → Median > x̄ → left skewed.
Difference > 10% → considerable → skewed.
Difference ≤ 10% → negligible → treat as normal.
Hormonal level x̄ = 315, Median = 370 → (370 − 315)/315 × 100 = 17.5% > 10% → considerable → skewed. Median > x̄ → left skewed.
x̄ = 40, Median = 35 → (40 − 35)/35 × 100 = 14.2% > 10% → skewed. x̄ > Median → right skewed.
x̄ = 20, Median = 19 → (20 − 19)/19 × 100 = 5.2% < 10% → negligible → normal.
| Method | Type | Notes |
|---|---|---|
| ① Graphical (Histogram) | Subjective | Depends on opinion; shows normal vs skewed visually. |
| ② Comparison of x̄ and Median | Objective | The 10% rule above. |
| ③ Box-and-whisker diagram | Objective | Whiskers equal → normal; right longer → right skewed; left longer → left skewed. |
| ④ Statistical test (Kolmogorov–Smirnov, K-S test) | Statistical | Used first to check if data is normal. |
Analytical (inferential) statistics looks at relation, association, and comparison — and lets us draw conclusions about a population from a sample.
The group the research targets (الفئة المستهدفة من البحث).
Researching prevalence of osteoporosis among Saudi females → target population = all Saudi females.
Do we study every single one of them? No — we take a representative sample: a sub-image of the community that saves time, money, and effort (the "triad of sample").
The result from the sample is inferred onto the community (generalization).
The hypothesis of no difference / equality.
= no difference, no relation.
The researcher's hypothesis — of difference / no equality.
= there is a difference, there is a relation.
H₀: x̄ blood pressure in Drug A = x̄ blood pressure in Drug B.
H₀: prevalence of lung cancer in smokers = prevalence of lung cancer in non-smokers.
How do we know which statistical test to use? It comes down to a sequence of decisions.
Expressed in the form of averages → analysed with quantitative-data statistical methods.
e.g. compare the mean age.
Expressed in the form of frequency or percentage → analysed with qualitative-data statistical methods.
e.g. compare the prevalence of cancer between groups.
Use Mean (x̄ ± SD) → analysed by parametric statistics.
Use Median & Median Rank → analysed by non-parametric statistics.
Compare the mean BMI in nulliparous and multiparous females.
Hypothesis of equality.
The researcher's hypothesis.
H₀: x̄ blood pressure in A = x̄ blood pressure in B (no difference).
Hₐ: x̄A ≠ x̄B (not equal — there is a difference).
H₀: prevalence of cancer in smoker = prevalence in non-smoker.
Hₐ: prevalence in smoker ≠ prevalence in non-smoker.
H₀: remission rate in drug B ≤ remission rate in drug A (new drug equals the old or less).
Hₐ: remission rate in drug B > remission rate in drug A.
H₀: blood glucose level in A ≥ blood glucose in B.
Hₐ: blood glucose level in A < blood glucose in B (the new drug lowers glucose more — the researcher's hypothesis).
The accepted probability of error in the research decision — commonly 10% or 5%.
| Research says ↓ / Truth → | No difference | Difference |
|---|---|---|
| No difference | ✓ correct | Type II error (β) |
| Difference | Type I error (α) | ✓ correct |
Socio-science: α = 10% to 5%. Medical field: α = 1% to 5% (0.01–0.05). Pharmaceutical: α = 0.01% to 0.001% (needs a very small error because it's given to humans).
Covered in detail in Lesson 9.
A real difference. e.g. comparing drug A vs B, A is genuinely better — repeating the trial would show the same.
A chance difference. e.g. one time A looks better, another time B does — so the difference is not real.
The p-value measures the role of chance. It is a number from 0 to 1.
p = 0.03 → significant. p = 0.912 → insignificant. p = 0.051 → just above the line.
If p ≤ α → Significant → Reject H₀ (due to a real difference, so Hₐ is correct).
If p > α → Insignificant → Fail to reject H₀ (not enough evidence; H₀ stands).
The required test is decided by four questions, in order.
One group: e.g. x̄ blood pressure of a single group. Two groups: e.g. compare BMI between men and women. More than two: e.g. compare depression score between single, married, and divorced women.
The most common. e.g. M vs F; single/married/divorced/widow.
① Matched design (to control a confounder, e.g. lung cancer & male gender). ② Pre–post assessment (e.g. BMI before and after intervention in the same people). ③ Repeated measures (same people measured repeatedly).
| Groups | Independent | Dependent |
|---|---|---|
| 2 groups | Independent t-test | Paired t-test |
| > 2 groups | One-way ANOVA | Repeated-measures ANOVA |
Compare BMI in male vs female. BMI = quantitative, normal; M & F = 2 independent groups → independent t-test.
Measure BMI in obese people before & after dietary intervention. Same people = dependent → paired t-test.
Compare BMI between Abha, Albaha, and Najran. Quantitative, normal, >2 independent groups → one-way ANOVA.
Measure BMI in pregnant women in the 1st, 2nd, and 3rd trimester. Quantitative, normal, >2 dependent → repeated-measures ANOVA.
| Groups | Independent | Dependent |
|---|---|---|
| 2 groups | Mann–Whitney test | Wilcoxon test |
| > 2 groups | Kruskal–Wallis test | Friedman test |
| Relation | Test |
|---|---|
| Independent | Pearson's Chi-Square test (χ²) — risk factor & outcome |
| Dependent | McNemar test |
Nora's study: relation between blood group and risk of obesity. Type = qualitative, independent → Pearson Chi-Square test.
Lana's study: high-knowledge prevalence before and after a course (high/low). Qualitative, two groups, dependent → McNemar test.
Compares the mean Hb level before and after iron for 2 months. ① Quantitative, normal ② 2 groups ③ dependent → paired t-test.
Assesses patient awareness score about HBV before, 1 month after, and 2 months after a health-education program, expressing the median per phase. ① Quantitative, skewed (median) ② >2 groups ③ dependent → Friedman test.
Correlation (relation / association) analysis tests for the correlation between two variables — independent (x) and dependent (y) — for the same individual. (Correlation and Regression are the two analyses of relation.)
+ve (direct): both x & y change in the same direction (x↑ y↑, x↓ y↓).
−ve (indirect/inverse): both change in opposite directions (x↑ y↓, x↓ y↑).
Weak · Intermediate · Strong. Combined with nature gives six categories: +ve weak/intermediate/strong and −ve weak/intermediate/strong.
x = study hours (independent), y = exam score (dependent). An upward cloud = +ve / direct correlation (if x↑ then y↑).
Nature is read from the direction; strength from how tightly the points hug the line: (a) weak, (b) intermediate, (c) strong. The best/strongest is a sharp upward band.
The correlation coefficient runs from −1 to +1. The sign tells the nature; the value tells the strength.
| Strength | Negative | Positive |
|---|---|---|
| Weak | −0.01 to <−0.25 | 0.01 to <0.25 |
| Intermediate | −0.25 to <−0.75 | 0.25 to <0.75 |
| Strong | −0.75 to <−1 | 0.75 to <1 |
| Special | 0 = no correlation · ±1 = perfect | |
r = 0.23 → +ve weak: positive weak correlation between x & y.
r = −0.81 → −ve strong: negative strong correlation.
r = 0.28 → +ve intermediate.
r = −1.2 → invalid (cannot be below −1).
Use when both variables are quantitative and normally distributed.
Use when: both variables skewed; one skewed & one normal; both qualitative ordinal; or mixed (one quantitative, one qualitative ordinal).
One risk (x) → one disease (y).
More than one risk (x₁, x₂, x₃) → one disease (y). Expressed as a matrix.
In a matrix the diagonal line is all 1's (each variable correlated with itself), and values mirror across it. Don't read the r-values alone — check the p-value first to know whether the relation is real.
x₁–y₁: r = 0.01, p = 0.02 (<0.05) → significant.
x₂–y₂: r = 0.99, p = 0.214 (>0.05) → insignificant (the high r is not trustworthy without significance).
Cholesterol level: r = 0.26 (+ve intermediate). BMI: r = 0.74 (+ve intermediate). Both land in "intermediate," yet their effects differ — and we still can't predict blood pressure at, say, BMI = 35.
e.g. blood pressure & BMI in mmHg.
e.g. BMI & hypertensive (yes/no). (See Lesson 12.)
The value of y when x = 0. e.g. birth weight; or salary on joining work (experience = 0). Any change in x doesn't change a.
The amount of change in y per one-unit change in x. The bigger b, the stronger the effect.
Neonatal age (x) & weight (y): a = birth weight. Work experience (x) & salary (y): a = starting salary.
Cholesterol & blood pressure, if b = 5 → one unit (1 gram) rise in cholesterol raises blood pressure by 5 mmHg. Study hours & exam score: 1 unit (1 hour) ↑ raises the mark by 5.
Weight (x) & cholesterol (y), expected cholesterol = 3 + 2 × weight. For weight = 75 kg → 3 + 2×75 = 153 g/dl. (Here a = 3 is the theoretical cholesterol of the first person; the predictor x can be quantitative or qualitative.)
Regression chooses the line with the least squared difference (deviation) — the smallest gap between actual values and predicted values. That line is the best-fit line.
One x → y. ŷ = a + b·x
Several risk factors → y. ŷ = a + b₁x₁ + b₂x₂ + b₃x₃ …
Relation between study hours, stress level, sleep hours, and lecture attendance (4 predictors) and exam score. "Significant" means at least one of them genuinely affects the exam mark.
A type of regression for a qualitative outcome (y).
Outcome has two categories. e.g. yes/no, diseased/non-diseased, gender F/M. → Binary logistic regression.
Outcome has more than two categories. e.g. marital status (single/married/divorced/widow); blood groups A, B, AB, O.
A linear function ŷ = a + b·x gives a quantitative outcome from −∞ to +∞. But binary logistic regression measures the probability of having the outcome, which must stay between 0 and 1.
Logistic regression uses the Sigmoid function (also called the Logit or Z-function) to squeeze the output into 0 to 1.
Probabilities of depression: P1 = 0.72 → have; P2 = 0.68 → have; P3 = 0.43 → not; P4 = 0.44 → not; P5 = 0.51 → have; P6 = 0.32 → not. (Patient 6 was observed "No" and the model also predicts "Not" — the model isn't perfect; here it's ~95% accurate, predicting 5 of 6.)
Predictors of depression among 300 DM patients (200 with depression, 100 without). Risk factors (x): age, gender, type of DM, duration of DM, complications, adherence to treatment → what is the probability of having the outcome? e.g. a 25-y patient, DM-I for 15 years, with complications, non-adherent → probability of depression ≈ 75%.
| OR value | Meaning | Example |
|---|---|---|
| OR = 1 | No risk | — |
| OR > 1 | Risk factor | Smoking & lung cancer |
| OR < 1 | Protective | Physical activity & MI |
OR = 5 → exposed showed 5× more risk than non-exposed. Smoking & lung cancer OR = 7 → smoker showed 7× more risk than non-smoker. Physical activity & MI OR = 0.5 → physically active showed half the risk of inactive.
Old age OR = 2 (2× risk vs young — risky). Male gender OR = 3 (3× risk vs female — risky). Physical activity OR = 0.1 (protective: active showed 0.1× the risk of inactive).
To find which factor is the most dangerous, compare like with like. Physical activity is protective (OR = 0.1), so to read it as a risk factor we invert it: OR for physically inactive = 1 / 0.1 = 10 — making inactivity the strongest risk factor here. (Software often reports OR via the eᵇ form.)
No topics match your search.