Biostatistics — A+ Research Study Guide

Basic Concept & Types of Data

Definitions · Constants vs Variables · Data classification

Definition · Biostatistics

The science of data collection, data manipulation, and data analysis to produce a result that supports decision-making.

There is no decision-making without a result ("evidence"). This is the basis of evidence-based medicine and evidence-based practice.

توضيحلا بد يكون القرار قائم على دليل ومرجعية — من أبحاث سابقة حوّلها لدليل، وطلعت منه نتيجة مدعومة بالمعلومة.

Example

"Drug A is better than Drug B to treat disease X." → Why? How did you know? There must be a prior, valid, sound study (a manuscript) backing the claim.

Types of Data

Constant

A value with no variability.
e.g. number of eyes, ears, fingers. Not useful for research — nothing varies to study.

Variable

The thing research is built on. We ask: ① Is there variability? ② What is the pattern? ③ What are the causes? ④ What are the consequences?

Variables = Variable Data

Quantitative · بيانات كمية

Variables expressed in the form of a number.
e.g. age, height, BMI, blood glucose level, cholesterol level, insulin level.

Qualitative · بيانات كيفية

We ask the patient and the answer is usually a "word" description.
e.g. blood group, diseases.

Quantitative splits into:

Continuous

Accepts the decimal number.
e.g. age, weight, height, blood pressure.

Discrete

Does not accept decimals — it is the whole value. "Number of…": beds in a hospital, family members, drugs received, cigarettes smoked per day, hospital admissions, RBCs.

مثال على Discrete"عدد صحيح كامل" — مثل عدد السجائر: حتى لو دخّن ربعها يُحسب كسيجارة كاملة. وعدد الإدخالات للمستشفى يُحسب دخولاً كاملاً حتى لو بقي ساعة.

Qualitative splits into:

Nominal

Cannot be put in order.
e.g. gender, residence, job title, blood group, disease.

Ordinal

Responses can be put in a definite / documented order.
e.g. grade (A+, A, B+, B), degree of burns, degree of pain, cancer stage.

Exam Note

In research, everything is modifiable except the methodology / analysis — that part shouldn't be changed (تعديل المنهجية = خطأ، يُرفض).

DATA

Constantno variability

→

Quantitativecontinuous · discrete

Qualitativenominal · ordinal

Methods of Data Presentation — Tables

Raw data · Tabular presentation · Research questions

Question: how do we present data in an organized way?

Raw Data

Data with no organization — you can't extract any information from it; it's just records. For a small sample (e.g. 5 patients, asking each 3 questions) raw data may be acceptable, but research data is usually large.

Threshold

For samples of 300 and above, use professional methods of data presentation — present data in the smallest space while extracting the most information.

Tabular Presentation

The most famous and oldest method. Any table is built on 3 rows and 3 columns as its basic structure. The intersection of a row and a column is called a cell.

Percent of a categoryPercent = Frequency of category × 100Total

Anatomy of a frequency table
Names of the variable (units)	Frequency (from sample)	Percent / relative frequency
Categories of variable	count per category	% per category
Total	n	100%

Worked example — blood group

Frequency distribution of blood group
Blood group	Frequency	Percent
A	10	(10 × 100) / 80
B	15	(15 × 100) / 80
AB	25	…
O	30	…
Total	80	100%

سؤال شائعهل ممكن يكون المجموع أقل من 100%؟ لا، لكن ممكن يكون أقرب من 100 (مثل 98.9% أو 99.9%) بسبب التقريب — وهذا مقبول.

Research Questions — two types

Mutually Exclusive (One-answer Q)

Accepts only one answer; selecting one prevents selecting any other.
e.g. gender, blood group, job title.

Multiple-response Q

Allows more than one answer by choice.
e.g. preferred food.

Example — multiple response can exceed 100%

We have 100 patients; ask each about DM, asthma, and hypertension. If all 100 have all 3 diseases, the percentages are 100% + 100% + 100% = 300% — perfectly possible here.

Every table needs a Title

A good title is self-explanatory and answers four questions: what? who? where? when?

Example title

"Frequency distribution table for the age (what) of 20 patients (who) in Najran main hospital (where), 2025 (when)."

The Comment on a table

Whether categories are specific depends on the variable type:

Qualitative

Categories are specific. e.g. gender → F, M.

Quantitative

Not specific. e.g. age — if you ask 1800 patients their age you can't list 1800 rows, so you convert to intervals.

Intervals example

18–29, 30–39, 40–49 … each band is an interval.

How to select the cut-off point?

If the variable has a documented cut-off point, use it (e.g. measuring patient BMI or blood pressure — classify on that established basis). Otherwise choose carefully, because:

Interval too wide

Over-summarization of data → you lose variability.

Interval too narrow

Lack of summarization → the table becomes huge again.

Methods of Data Presentation — Graphs

Line · Bar · Pie · Scatter

Graphical Presentation — key points

Graphs are usually constructed from a table.
Graphs are NOT more informative than a table (a table carries ~90–95% more information).
Journals often dislike graphs — they take more space and cost more.
Graphs are mostly used for reports or presentations — they make presenting easier and idea-extraction faster.

السببطريقة السهل في عرض البيانات واستخلاص الخلاصة، لأنها تستخدم حواس أكثر (الإدراك) فالفهم أفضل في العرض.

① Line Graph

Use

Used only to present data measured over a time period — a certain variable (y) against a time unit (x). It shows the trend of data over time.

Example axes

y = rainfall (per hectare); x = time. Temperature chart: peak at 8 pm.

Trend shapes

Trends

Upward trend
Downward trend

Trends

Stationary trend
Oscillating trend

② Bar Graph

Use

Used for qualitative data, in both simple and complex tables. (Complex tables add F & M side by side, with a legend / key.)

Criteria of a bar

The width of the bar is equal across bars.
The space between bars equals the width (or ½ the width).
The height is proportionate to the category frequency or percent.

Example — best by frequency vs percent

Class has 5 females (4 pass) and 6 males. "Who is best by frequency?" → answer "Female" is a mistake, because group sizes differ. "Who is best by percent?" → Male = 80%, Female = 60%. Percent is usually correct; frequency is correct only in a simple table.

ليش الـ percent أصح؟لأن المقارنة بين مجموعات بأحجام مختلفة لازم تكون بالنسبة وليس بالعدد المطلق — وإلا تكون النتيجة مضللة.

Small-sample caution + prevalence

Blood group A → 2, O → 50. Comparing D⁺/D⁻: A → 1 = 50%, 1 = 50%; O → 20 = 40%, 30 = 60%. Prevalence of disease is "more in" group A? → that's a mistake: 50% of A is just one person, so the percent here is misleading. Watch out for tiny samples.

③ Pie Chart

Use

Used for qualitative data in a simple table. Each category is a sector — a specific angle of the circle (total = 360°).

The higher the frequency of a category, the larger its sector; the lower the frequency, the smaller the sector.

Sector angleAngle = Frequency × 360°Total

Worked example — gender (M=5, F=15, Total=20)

Male angle = (5 × 360) / 20 = 90°
Female angle = (15 × 360) / 20 = 270°

④ Scatter Diagram

Use

Used in correlation analysis. x = independent variable y = dependent variable.

Variable examples

x = exercise hrs, study hrs, BMI | y = weight loss, exam mark, blood pressure.

Correlation analysis — two things it tells us

Strength (between 2 variables)

Weak: if x increases, y increases weakly.
Intermediate: moderate increase. (e.g. calories & weight gain)
Strong: strong increase. (e.g. exercise hrs & weight loss)

Nature (of the relation)

+ve (طردية): if x increases, y increases.
−ve (عكسية): if x increases, y decreases. (e.g. stress & exam score)
No correlation: no pattern between the points.

Measures of Central Tendency (Averages)

Mid-range · Mode · Median · Mean

Definition · Average

One representative value for the whole data. Other name: measure of central tendency. Goal: summarize the data into one typical value and help in comparison.

Example — GPA by gender

Comparing male vs female GPA averages (M = 3.6, F = 4.1) lets us conclude females have the higher average — even though we only had grouped percentages. The average condenses everything into one comparable value.

① Mid-range (MR)

Mid-rangeMR = smallest observation + largest observation2

Example — weights of children: 5, 7, 8, 3, 10 kg

MR = (3 + 10) / 2 = 13 / 2 = 6.5 kg

Advantage

Easy to calculate.

Disadvantages

Doesn't consider all values.
Not suitable for qualitative data.
Affected by extreme values (outliers).

Outlier sensitivity

Data 3,2,7,10,5 → MR = (2+10)/2 = 6 (reasonable).
Data 3,2,7,10,50 → MR = (2+50)/2 = 26 (biased by the outlier 50). An "extreme value / outlier" is a value out of trend, e.g. a hormonal level of 105 among 7,8,10,11,13.

② Mode

Definition

The most frequent observation.

3,2,1,4,6,7 → no mode present
3,2,1,3,4,6 → Mode = 3 (unimodal)
3,2,3,1,4,6,4 → Mode = 3,4 (bimodal)
3,2,3,1,2,4,4 → Mode = 2,3,4 (multimodal)
3,2,1,2,3,1,4,4 → no mode (all equally frequent)

Advantages

Easy to calculate.
Used with qualitative data (e.g. M,F,M,M,F → Mode = M).
Not affected by extreme values (e.g. 1,2,3,1,7,500 → Mode = 1).

Disadvantages

Data may have no mode.
Data may have more than one mode.
Doesn't consider all values.

③ Median

Definition

The middle value — observations below = observations above (50% each side).

Steps

Order the data ascending or descending.
Find the median rank, then read off the value.

Odd number (n)

Median rank = n + 12

Example: 2,7,1,4,5,8,2 → ordered 1,2,2,4,5,7,8 → n = 7 → rank = 8/2 = 4ᵗʰ → Median = 4

Even number (n)

rank = (n/2) and (n/2)+1

Example: ordered 1,2,3,4,5,7,10,12,15 → n = 8 → ranks 4ᵗʰ & 5ᵗʰ = 5,7 → (5+7)/2 = Median = 6

Advantages

Easy to calculate.
Not affected by extreme values.

Disadvantages

Not suitable for qualitative data.
Doesn't consider all values.

④ Mean (Arithmetic Mean, x̄)

Definition

The "measure of justice" — the balance point of the data.

Meanx̄ = Σx (sum of observations)n (number of observations)

Examples

3,3,4,8,10 → x̄ = (3+3+4+8+10)/5 = 5.6
20,25,13,7,2 → x̄ = (20+25+13+7+2)/5 = 13.4

Advantages

Considers all values → most accurate.
Easy to calculate.

Disadvantages

Not suitable for qualitative data.
Affected by extreme values: 1,2,3,4,5 → x̄ = 3, but 1,2,3,4,100 → x̄ = 22.

Decision Rule — which average to use

Qualitative data → Mode. Quantitative data → if it has extreme values use the Median; if no extreme values use the Mean.

DATA → which average?

Qualitative→ Mode

Quant. + extreme value→ Median

Quant. + no extreme value→ Mean

Measures of Dispersion

Why averages aren't enough · Range · Standard deviation

Does an average alone describe data accurately? Not always.

Motivating example — exam out of 100

Males scored 0 and 100; Females scored 40 and 60. Both have an average of 50%. But are they the same? No — the males have high variability. So an average alone is not enough; we report Average ± Dispersion.

Reporting formx̄_M = 50 ± 50 (ranges 0 to 100) | x̄_F = 50 ± 10 (ranges 40 to 60)

Definition

Dispersion (variability) measures how spread out the data is around its average — مقاييس التشتت.

① Range

RangeRange = largest observation − smallest observation

Examples

3,5,7,2,10 → Range = 10 − 2 = 8 (paired with an average).
Children's weights 5,3,2,7,1 kg → Range = 7 − 1 = 6 kg.

② Standard Deviation (SD)

Definition · الانحراف المعياري

Used with the mean, reported as x̄ ± SD.

Examples

x̄ ± SD = 60 ± 10 kg → most cluster around 60, ranging 50 to 70.
7 ± 2 → most around 7, ranging 5 to 9 (e.g. hospital stay in days).

Pairing Rule

With the Mean (x̄) → use SD. With the Median → use the Range.

Median + Range example

3,2,1,7,5 → ordered 1,2,3,5,7 → Median = 3, Range = 7 − 1 = 6 → reported as 3 (6).

Shape of Data Distribution

Normal · Skewed · Mean–Median rule · Testing shape

Context

Shape applies only to quantitative data. The presence of an extreme value changes the shape, which changes how we handle the data statistically.

① Normal Distribution

When it occurs

When you collect natural data from people — e.g. ages 1,2,3,4,5 → x̄ = 3, Median = 3 → a normal distribution.

Four features of a normal distribution

Bell-shaped — frequency high in the middle, low at the young and old ends.
Symmetric — a mirror image around the center.
Unimodal — one mode (one peak).
Area under curve = 100% of cases.

Key identity

In a normal distribution: Mode = Mean (x̄) = Median. (More than one mode → bimodal/multimodal → not normal.)

The empirical rule (SD ranges)

Coveragex̄ ± 1 SD = 68% · x̄ ± 2 SD = 95% · x̄ ± 3 SD = 99%

Worked example — blood glucose, mean = 110 mg/dL, SD = 10

68% (x̄ ± 1 SD): 110 − 10 = 100 → 110 + 10 = 120 → 100 to 120
95% (x̄ ± 2 SD): 110 − 20 = 90 → 110 + 20 = 130 → 90 to 130
99% (x̄ ± 3 SD): 110 − 30 = 80 → 110 + 30 = 140 → 80 to 140

② Skewed Distributions

If there's an extreme value, the data is not normal — it's skewed (not bell-shaped, not symmetric). The data has a tail; the tail's direction names the skew.

Right (Positive) Skewed

Long right tail; large extreme values at the upper limit. Most of the sample is in low values, a small portion in high.
e.g. income.

Left (Negative) Skewed

Long left tail; lower extreme values. Large sample in older age (>50), small sample in early age.
e.g. retirement age.

Position rule (memorize)

Normal: x̄ = Median = Mode
Right skewed: x̄ > Median > Mode (Mode < Median < x̄)
Left skewed: x̄ < Median < Mode

Examples

BMI data: x̄ = 35, Median = 20 → x̄ > Median → right skewed.
BMI data: x̄ = 26, Median = 31 → Median > x̄ → left skewed.

The 10% Rule (Mean vs Median)

Difference= large value − small valuesmall value × 100

Interpretation

Difference > 10% → considerable → skewed.
Difference ≤ 10% → negligible → treat as normal.

Worked examples

Hormonal level x̄ = 315, Median = 370 → (370 − 315)/315 × 100 = 17.5% > 10% → considerable → skewed. Median > x̄ → left skewed.
x̄ = 40, Median = 35 → (40 − 35)/35 × 100 = 14.2% > 10% → skewed. x̄ > Median → right skewed.
x̄ = 20, Median = 19 → (20 − 19)/19 × 100 = 5.2% < 10% → negligible → normal.

How to test the shape of distribution

Methods for testing distribution shape
Method	Type	Notes
① Graphical (Histogram)	Subjective	Depends on opinion; shows normal vs skewed visually.
② Comparison of x̄ and Median	Objective	The 10% rule above.
③ Box-and-whisker diagram	Objective	Whiskers equal → normal; right longer → right skewed; left longer → left skewed.
④ Statistical test (Kolmogorov–Smirnov, K-S test)	Statistical	Used first to check if data is normal.

QUANTITATIVE DATA

Normalno extreme value

Right skewedextreme value (upper)

Left skewedextreme value (lower)

Concept of Analytical Statistics

Populations & samples · Hypotheses

Analytical (inferential) statistics looks at relation, association, and comparison — and lets us draw conclusions about a population from a sample.

Target Population

The group the research targets (الفئة المستهدفة من البحث).

Example

Researching prevalence of osteoporosis among Saudi females → target population = all Saudi females.

Do we study every single one of them? No — we take a representative sample: a sub-image of the community that saves time, money, and effort (the "triad of sample").

Inference / Generalization

The result from the sample is inferred onto the community (generalization).

Defining the target population precisely

Depression among diabetic patients → target = diabetic patients.
Depression among type-2 diabetics → target = type-2 diabetics.
Depression among uncontrolled type-2 diabetics → target = uncontrolled type-2 DM patients.
Risk factor of still birth → target = mothers of still birth.

Hypothesis

Null Hypothesis (H₀)

The hypothesis of no difference / equality.
= no difference, no relation.

Alternative Hypothesis (Hₐ or H₁)

The researcher's hypothesis — of difference / no equality.
= there is a difference, there is a relation.

Example — Drug A (new) vs Drug B (old) for hypertension

H₀: x̄ blood pressure in Drug A = x̄ blood pressure in Drug B.

Example — smoking and lung cancer

H₀: prevalence of lung cancer in smokers = prevalence of lung cancer in non-smokers.

ملاحظةالباحث يشتغل على الـ Null hypothesis — يحاول يرفضها ليثبت فرضيته البديلة.

How to Analyse Data

Choosing tests · Hypotheses · Significance level · p-value

How do we know which statistical test to use? It comes down to a sequence of decisions.

① Know the type of data

Quantitative

Expressed in the form of averages → analysed with quantitative-data statistical methods.
e.g. compare the mean age.

Qualitative

Expressed in the form of frequency or percentage → analysed with qualitative-data statistical methods.
e.g. compare the prevalence of cancer between groups.

② If Quantitative — is it Normal or Skewed?

If Normal

Use Mean (x̄ ± SD) → analysed by parametric statistics.

If Skewed

Use Median & Median Rank → analysed by non-parametric statistics.

Example

Compare the mean BMI in nulliparous and multiparous females.

③ Research Hypothesis

Null Hypothesis (H₀)

Hypothesis of equality.

Alternative Hypothesis (Hₐ)

The researcher's hypothesis.

Example — mean blood pressure, Group A (new drug) vs Group B (old drug)

H₀: x̄ blood pressure in A = x̄ blood pressure in B (no difference).
Hₐ: x̄_A ≠ x̄_B (not equal — there is a difference).

Example — lung cancer in smoker vs non-smoker (qualitative)

H₀: prevalence of cancer in smoker = prevalence in non-smoker.
Hₐ: prevalence in smoker ≠ prevalence in non-smoker.

Example — one-sided hypotheses (autism drug: A = old/high cost, B = new/low cost)

H₀: remission rate in drug B ≤ remission rate in drug A (new drug equals the old or less).
Hₐ: remission rate in drug B > remission rate in drug A.

Example — DM drug (A = new, B = old)

H₀: blood glucose level in A ≥ blood glucose in B.
Hₐ: blood glucose level in A < blood glucose in B (the new drug lowers glucose more — the researcher's hypothesis).

④ Level of significance (Alpha, α)

Definition

The accepted probability of error in the research decision — commonly 10% or 5%.

Truth vs Research — the two error types

Decision outcomes vs the truth
Research says ↓ / Truth →	No difference	Difference
No difference	✓ correct	Type II error (β)
Difference	Type I error (α)	✓ correct

Type I error (α = 0.05)

We say there is a difference, but actually there is no difference.
This is the more dangerous error.

Type II error (β)

We say there is no difference, but actually there is a difference.

العلاقة مع حجم العينةكلما زادت الـ sample كلما قلّ الـ alpha، وكلما قلّت الـ sample كلما زاد الـ alpha.

α by field

Socio-science: α = 10% to 5%. Medical field: α = 1% to 5% (0.01–0.05). Pharmaceutical: α = 0.01% to 0.001% (needs a very small error because it's given to humans).

⑤ Selecting the appropriate statistical test

Covered in detail in Lesson 9.

⑥ Statistical decision

Significant

A real difference. e.g. comparing drug A vs B, A is genuinely better — repeating the trial would show the same.

Insignificant

A chance difference. e.g. one time A looks better, another time B does — so the difference is not real.

The p-value — how we tell the difference

Definition

The p-value measures the role of chance. It is a number from 0 to 1.

p-value high

→ Insignificant (closer to 1, more chance).

p-value low

→ Significant (closer to 0, real difference).

Examples

p = 0.03 → significant. p = 0.912 → insignificant. p = 0.051 → just above the line.

Decision Rule (objective method)

If p ≤ α → Significant → Reject H₀ (due to a real difference, so Hₐ is correct).
If p > α → Insignificant → Fail to reject H₀ (not enough evidence; H₀ stands).

Selecting the Right Statistical Test

Data type → distribution → number of groups → relation

The required test is decided by four questions, in order.

Type of data — quantitative or qualitative?
Distribution — normal or skewed? (If quantitative & normal → parametric, Mean ± SD. If skewed → non-parametric, Median & Median Rank.)
Number of groups — one, two, or more than two?
Relation between groups — independent or dependent?

Number of groups — examples

One group: e.g. x̄ blood pressure of a single group. Two groups: e.g. compare BMI between men and women. More than two: e.g. compare depression score between single, married, and divorced women.

Relation between groups

Independent (unrelated)

The most common. e.g. M vs F; single/married/divorced/widow.

Dependent (related) — 3 situations

① Matched design (to control a confounder, e.g. lung cancer & male gender). ② Pre–post assessment (e.g. BMI before and after intervention in the same people). ③ Repeated measures (same people measured repeatedly).

Confounder & Matchedالـ Confounder عامل دخيل يؤثر على العلاقة بين متغيّرين، فنعمل Matched (مزاوجة) لإلغاء تأثيره — لذلك تصبح المجموعات Dependent (مثل Case & Control).

Parametric tests (Quantitative · Normal)

Quantitative, normally distributed → parametric methods
Groups	Independent	Dependent
2 groups	Independent t-test	Paired t-test
> 2 groups	One-way ANOVA	Repeated-measures ANOVA

Independent t-test

Compare BMI in male vs female. BMI = quantitative, normal; M & F = 2 independent groups → independent t-test.

Paired t-test

Measure BMI in obese people before & after dietary intervention. Same people = dependent → paired t-test.

One-way ANOVA

Compare BMI between Abha, Albaha, and Najran. Quantitative, normal, >2 independent groups → one-way ANOVA.

Repeated-measures ANOVA

Measure BMI in pregnant women in the 1st, 2nd, and 3rd trimester. Quantitative, normal, >2 dependent → repeated-measures ANOVA.

Non-parametric tests (Quantitative · Skewed)

Quantitative, skewed → non-parametric methods
Groups	Independent	Dependent
2 groups	Mann–Whitney test	Wilcoxon test
> 2 groups	Kruskal–Wallis test	Friedman test

Qualitative data tests

Qualitative data
Relation	Test
Independent	Pearson's Chi-Square test (χ²) — risk factor & outcome
Dependent	McNemar test

Pearson Chi-Square

Nora's study: relation between blood group and risk of obesity. Type = qualitative, independent → Pearson Chi-Square test.

McNemar

Lana's study: high-knowledge prevalence before and after a course (high/low). Qualitative, two groups, dependent → McNemar test.

Worked case questions

Case 1 — Ahmed

Compares the mean Hb level before and after iron for 2 months. ① Quantitative, normal ② 2 groups ③ dependent → paired t-test.

Case 2 — Lora

Assesses patient awareness score about HBV before, 1 month after, and 2 months after a health-education program, expressing the median per phase. ① Quantitative, skewed (median) ② >2 groups ③ dependent → Friedman test.

CHOOSING A TEST

1. Data typequant / qual

2. Distributionnormal / skewed

3. Groups1 / 2 / >2

4. Relationindependent / dependent

Correlation Analysis

Nature & strength · Scatter diagram · Correlation coefficient

Definition

Correlation (relation / association) analysis tests for the correlation between two variables — independent (x) and dependent (y) — for the same individual. (Correlation and Regression are the two analyses of relation.)

Test of correlation — two things

Nature

+ve (direct): both x & y change in the same direction (x↑ y↑, x↓ y↓).
−ve (indirect/inverse): both change in opposite directions (x↑ y↓, x↓ y↑).

Strength

Weak · Intermediate · Strong. Combined with nature gives six categories: +ve weak/intermediate/strong and −ve weak/intermediate/strong.

① Graphical method — scatter diagram (subjective)

Example axes

x = study hours (independent), y = exam score (dependent). An upward cloud = +ve / direct correlation (if x↑ then y↑).

Reading the scatter

Nature is read from the direction; strength from how tightly the points hug the line: (a) weak, (b) intermediate, (c) strong. The best/strongest is a sharp upward band.

② Correlation coefficient (mathematical · objective)

Range

The correlation coefficient runs from −1 to +1. The sign tells the nature; the value tells the strength.

Interpretation of the correlation coefficient
Strength	Negative	Positive
Weak	−0.01 to <−0.25	0.01 to <0.25
Intermediate	−0.25 to <−0.75	0.25 to <0.75
Strong	−0.75 to <−1	0.75 to <1
Special	0 = no correlation · ±1 = perfect

Reading values

r = 0.23 → +ve weak: positive weak correlation between x & y.
r = −0.81 → −ve strong: negative strong correlation.
r = 0.28 → +ve intermediate.
r = −1.2 → invalid (cannot be below −1).

Types of correlation coefficient

Pearson (r) · parametric

Use when both variables are quantitative and normally distributed.

Spearman (rho) · non-parametric

Use when: both variables skewed; one skewed & one normal; both qualitative ordinal; or mixed (one quantitative, one qualitative ordinal).

Simple linear correlation

One risk (x) → one disease (y).

Multiple linear correlation

More than one risk (x₁, x₂, x₃) → one disease (y). Expressed as a matrix.

The correlation matrix & p-value

In a matrix the diagonal line is all 1's (each variable correlated with itself), and values mirror across it. Don't read the r-values alone — check the p-value first to know whether the relation is real.

Example

x₁–y₁: r = 0.01, p = 0.02 (<0.05) → significant.
x₂–y₂: r = 0.99, p = 0.214 (>0.05) → insignificant (the high r is not trustworthy without significance).

Regression Analysis

Why beyond correlation · Linear regression · Coefficients · Model checks

Why correlation isn't enough

Cholesterol level: r = 0.26 (+ve intermediate). BMI: r = 0.74 (+ve intermediate). Both land in "intermediate," yet their effects differ — and we still can't predict blood pressure at, say, BMI = 35.

Disadvantages of the correlation coefficient

Due to its range interpretation, two variables with different effects may share the same label.
Can't predict y from a given x.

Regression — two advantages

Gives the exact relation between x and y.
Predicts y for a given x.

Types of regression — depend on the outcome (y)

Outcome Quantitative → Linear regression

e.g. blood pressure & BMI in mmHg.

Outcome Qualitative → Logistic regression

e.g. BMI & hypertensive (yes/no). (See Lesson 12.)

Linear regression equation

Simple linear regressionŷ = a + b·x

a = constant (intercept)

The value of y when x = 0. e.g. birth weight; or salary on joining work (experience = 0). Any change in x doesn't change a.

b = regression coefficient (slope/angle)

The amount of change in y per one-unit change in x. The bigger b, the stronger the effect.

Examples of a and b

Neonatal age (x) & weight (y): a = birth weight. Work experience (x) & salary (y): a = starting salary.
Cholesterol & blood pressure, if b = 5 → one unit (1 gram) rise in cholesterol raises blood pressure by 5 mmHg. Study hours & exam score: 1 unit (1 hour) ↑ raises the mark by 5.

Prediction example

Weight (x) & cholesterol (y), expected cholesterol = 3 + 2 × weight. For weight = 75 kg → 3 + 2×75 = 153 g/dl. (Here a = 3 is the theoretical cholesterol of the first person; the predictor x can be quantitative or qualitative.)

Best-fit line

Definition

Regression chooses the line with the least squared difference (deviation) — the smallest gap between actual values and predicted values. That line is the best-fit line.

Simple vs Multiple linear regression

Simple

One x → y. ŷ = a + b·x

Multiple

Several risk factors → y. ŷ = a + b₁x₁ + b₂x₂ + b₃x₃ …

Three checks for a regression model

Significance of model (p-value): significant → at least one significant predictor for the outcome; insignificant → no real relation.
Model fit (R²): the amount of variance in y explained by x. 30% and above → model accepted; <30% → weak fit.
Individual effect of each x (from b): larger b = stronger. e.g. x₁ b=7, x₂ b=10 (strongest), x₃ b=2 (weak).

Example

Relation between study hours, stress level, sleep hours, and lecture attendance (4 predictors) and exam score. "Significant" means at least one of them genuinely affects the exam mark.

Logistic Regression

Qualitative outcomes · Probability · Odds Ratio

When it's used

A type of regression for a qualitative outcome (y).

Dichotomous / Binary

Outcome has two categories. e.g. yes/no, diseased/non-diseased, gender F/M. → Binary logistic regression.

Multichotomous

Outcome has more than two categories. e.g. marital status (single/married/divorced/widow); blood groups A, B, AB, O.

Why a special function?

A linear function ŷ = a + b·x gives a quantitative outcome from −∞ to +∞. But binary logistic regression measures the probability of having the outcome, which must stay between 0 and 1.

The fix

Logistic regression uses the Sigmoid function (also called the Logit or Z-function) to squeeze the output into 0 to 1.

Reading the probability

Probability < 50% (<0.5)

→ −ve → outcome D⁻ (does not have it).

Probability ≥ 50% (≥0.5)

→ +ve → outcome D⁺ (has it).

Worked example — 6 DM patients, cut-off = 0.5

Probabilities of depression: P1 = 0.72 → have; P2 = 0.68 → have; P3 = 0.43 → not; P4 = 0.44 → not; P5 = 0.51 → have; P6 = 0.32 → not. (Patient 6 was observed "No" and the model also predicts "Not" — the model isn't perfect; here it's ~95% accurate, predicting 5 of 6.)

Predictor example

Predictors of depression among 300 DM patients (200 with depression, 100 without). Risk factors (x): age, gender, type of DM, duration of DM, complications, adherence to treatment → what is the probability of having the outcome? e.g. a 25-y patient, DM-I for 15 years, with complications, non-adherent → probability of depression ≈ 75%.

Three checks for a logistic model

Model significance (p-value): p ≤ 0.5 significant → at least one included risk factor is a true risk factor; > 0.5 insignificant.
Model fit (classification accuracy): how many cases the model predicts correctly. Cut-off 30% and above accepted.
Individual effect (Odds Ratio): how many times an exposed person is more or less likely to have the disease.

Odds Ratio (OR) — interpretation

Reading the Odds Ratio
OR value	Meaning	Example
OR = 1	No risk	—
OR > 1	Risk factor	Smoking & lung cancer
OR < 1	Protective	Physical activity & MI

Comment examples

OR = 5 → exposed showed 5× more risk than non-exposed. Smoking & lung cancer OR = 7 → smoker showed 7× more risk than non-smoker. Physical activity & MI OR = 0.5 → physically active showed half the risk of inactive.

Risk factors of MI

Old age OR = 2 (2× risk vs young — risky). Male gender OR = 3 (3× risk vs female — risky). Physical activity OR = 0.1 (protective: active showed 0.1× the risk of inactive).

Flipping a protective factor

To find which factor is the most dangerous, compare like with like. Physical activity is protective (OR = 0.1), so to read it as a risk factor we invert it: OR for physically inactive = 1 / 0.1 = 10 — making inactivity the strongest risk factor here. (Software often reports OR via the eᵇ form.)