The study took place in Botswana with 4,550 households. We compare our sample to national-level indicators and find that the final sample has characteristics that match those of a nationally representative sample as described in the sections below. Supplementary Fig. 1 shows a heat map of the location of the children’s schools to demonstrate the distribution of participants across the country. Supplementary Fig. 2 provides a timeline of each step from initial phone number collection, piloting and training, programme implementation and waves of data collection. Supplementary Fig. 3 provides an overview of the experimental design. Of working phone numbers, 71% were reachable and gave consent to participate in the study.
We randomized the 4,550 phone numbers into three groups of equal size: a weekly SMS message followed by a phone call, a weekly SMS message only and a control group. We further randomly cross-randomized 2,250 numbers for a midline assessment, and approximately 1,600 of these were randomly selected to receive targeted instruction customized to their learning level using the data collected at midline. The initial randomization to SMS, phone calls and SMS, or the control group was stratified on whether at least one child in the household had previously participated in previous school-based educational programming, a proxy for having recently made substantial learning gains. Each phone number belongs to a caregiver and household.
Sample characteristics and representativeness
We include a few descriptive statistics to describe how our sample, which represents around 15% of all primary schools in Botswana, compares to characteristics of nationally representative samples. Botswana has nine regions in total and our sample covers eight of them, including the most remote and low-literacy regions.
Extended Data Fig. 1 compares study sample characteristics with national indicators for a subset of indicators. We find a similar gender split of between 50% and 52% in our sample and nationwide. We also find a similar ratio of rural students in our sample to the national average of 29%. We find similar distributions of learning: the percentage of students who score an A, B and C is 16%, 21% and 41% in study schools, respectively, and 14%, 17% and 36% for all primary schools in the nation.
In addition, we collect simple descriptive data on child age, grade and gender in surveys. Around 50% of our sample is female; the average age of students is 9.7; 28.5% of students are in grade 3, 39.1% in grade 4 and 32.4% in grade 5. The average age of caregivers participating in the randomized trial was 35, and 68% of them were female. Our data show that in the control group the median caregiver (48.5%) spends just 1–2 h on educational activities with their child per week. We asked households to nominate the best person to provide educational support to their child during school disruption: 81% of nominated caregivers were parents, 7.6% were grandparents, 7.8% aunts or uncles, and 2.8% siblings. Additional details on the primary caregiver are in Supplementary Information ‘Section A: Sample Characteristics’.
For a subsample of parents (n = 209), we also measure parental education level and additional characteristics. This subset is not necessarily representative of the entire sample. However, they were the most responsive parents, suggesting that they probably represent an upper bound of the most literate parents. In the sample, 29% had completed schooling beyond secondary school, compared with a national average of 26% based on data from the World Bank. These measures suggest that the sample of parents have similar education levels to the national average. Moreover, the sample in the study has moderate literacy rates similar to other low- and middle-income countries. While the average secondary schooling completion rate in Europe and Central Asia is over 90%, average completion rates in lower middle-income countries are only just above 70%47.
For our two main learning outcomes focused on foundational numeracy skills—average level and place value—Fig. 1 (see also Table 1) shows large, statistically significant learning differences between treatment and control groups. For the combined phone and SMS group, there was a 0.121 standard deviation (95% CI 0.031, 0.210; P = 0.008) increase in the average numerical operation. The learning gains for the combined phone and SMS intervention also translate to other foundational skill competencies, such as gains in place value of 0.114 standard deviations (95% CI 0.028, 0.200; P = 0.009). For households that participated in all sessions, instrumental variables analysis in Extended Data Fig. 2 shows learning gains of 0.167 standard deviations (95% CI 0.046, 0.289; P = 0.007). As we show later, these results are robust to several validity checks. We find no statistically significant effects on average for the SMS-only intervention across all three learning proficiencies—average level, place value and fractions (P = 0.602, 0.837 and 0.309, respectively).
These results reveal that combined phone and SMS low-tech interventions can generate substantial learning gains, and that SMS messages alone are not as effective (P = 0.033). This suggests that SMS messages might not be as effective as direct instruction on their own; instead, they might be best placed as a complement to direct instruction through phone calls as in this study or as an accountability nudge for education systems, for example, as reminders for parents to monitor their child’s academic progress12.
To put the effect sizes of the joint phone and SMS treatment in context, ref. 48 provides benchmarks based on a review of 1,942 effect sizes from 747 randomized controlled trials (RCTs) evaluating education interventions with standardized test outcomes. In this review, 0.10 is the median effect size. A review in ref. 49 also finds 0.10 median effect sizes across 130 RCTs in low- and middle-income countries. Our findings show effect sizes that are around or above the median effect size, with a relatively cheap and scalable intervention. We further include non-standardized effect sizes in Extended Data Fig. 3. We find a 31% reduction in absolute innumeracy (students who cannot do any numerical operations) and an average level gain on the ASER assessment of 0.15 levels (95% CI 0.039, 0.262; P = 0.008). As a benchmark, a highly effective in-school educational programme, Teaching at the Right Level, achieved average improvement in math ASER levels of 0.09 to 0.13 in Bihar, India15. Furthermore, the learning gains observed were achieved in a total dosage of just 3 h of direct instruction spread over 8 weeks. If effects persist with a higher dosage, up to a 1–2 ASER level gain could potentially be achieved with 20–40 h of instruction, a typical educational programme dosage. Note that learning gains observed might be driven by either learning gains, minimizing learning loss or a combination of both.
In Extended Data Fig. 4, we explore heterogenous treatment effects along three dimensions: student gender, student grade and baseline school exam performance. These variables are typical predictors of learning and were available at baseline. We find limited evidence of heterogeneity along any of these margins, with interaction effects showing no significant effect (see figure for fully reported results). This suggests that the programme works equally well across these subpopulations. One possible explanation for the striking lack of heterogeneity in treatment effects is the focus of the intervention on foundational concepts, which applied to nearly all students. Moreover, since the phone calls were a one-on-one interaction, this ensured that no student was left behind.
We run a series of validity checks for our remote assessments and treatment effects. First, we randomize problems that test the same proficiency, a version of a reliability test used in the psychometric literature17. We randomize five problems for each proficiency including for addition, subtraction, multiplication, division and fractions (Table 2). We find that each random problem across all proficiencies is not statistically significantly different compared with a base random problem. Relatedly, we find no difference in treatment effects by the random question received for each proficiency. These tests reveal that the phone-based learning assessment has a high level of internal reliability. Details of statistical results, including P values, standard errors and F-tests are shown in Table 2.
We further disentangle cognitive skills gains from effort effects, which have been shown to affect test scores18. In our context, where learning outcomes are measured remotely in the household, effort might be particularly important. We test this hypothesis with a real-effort task requiring one to spend time to think about the question and exert effort or motivation to answer it beyond simple numerical proficiency (see Methods). As shown in column 1 of Extended Data Fig. 5, around 29% of students could answer this question in the control group, and we find no statistically significant changes in effort as a result of any of the interventions (β = 0.016, 95% CI −0.026, 0.058; P = 0.448 and β = 0.021, 95% CI −0.021, 0.0630; P = 0.335). Column 2 shows the effect on average level as a reference. These results indicate that learning gains due to the intervention are largely a function of cognitive skill rather than effort on the test.
It is also possible that learning gains are a matter of familiarity with the content in the intervention groups which received exposure to similar material as on the endline assessment. The familiarity hypothesis is partially tested by randomizing problems of the same proficiency, since this exogenously varies the question asked to minimize overlap with any particular question asked during the intervention itself; this does not change our results. We also test the familiarity hypothesis by including content not covered during the intervention, but which is related, such as place values; as noted earlier, we find that in the phone and SMS group, learning gains can translate to this skill.
We further explore a psychometric validity assessment known as the known-groups method. This approach quantifies whether test scores detect signal across groups that are known to differ50. We explore differences in learning level by student age and grade in the control group, two of the factors known to most affect differences in cognitive skills in the status quo. We find in Extended Data Fig. 6 that the assessment detects large and statistically significant differences across both dimensions. For each grade, students score around half an ASER level higher (P < 0.001), demonstrating the assessments’ ability to differentiate among known groups (β = 0.476, 95% CI 0.377, 0.576).
We include a series of additional robustness checks in Supplementary Tables 1 and 2, including P values using randomization inference51 and a joint test of significance for key foundational numeracy learning outcomes. We find small differences in P values overall, and that overall results hold, probably because of the large study sample size, which reduces the likelihood of these P values differing substantially (see Supplementary Tables 1 and 2 for full statistical results).
Lastly, we explore how effects vary on the basis of whether instruction is targeted to the learner’s learning level. As seen in Table 1, we find an effect on average level for targeted content of β = 0.076 (95% CI −0.014, 0.165; P = 0.097) and β = 0.070 on average level for non-targeted content (95% CI −0.021, 0.160; P = 0.130). The direct comparison between targeted and non-targeted instruction has a P value of 0.896. Targeted instruction translated to increased learning when compared with the control and improves understanding of place values by 0.098 standard deviations (95% CI 0.012, 0.185; P = 0.026). Targeted instruction also benefits learning higher-order competencies such as understanding fractions, with 0.093 standard deviation gains against the control (95% CI 0.004, 0.182; P = 0.041). There were no significant effects on learning for non-targeted instruction against the control. The difference between targeted and non-targeted instruction is not statistically significant (see Table 1).
We explore parental demand and engagement mechanisms. Parental engagement in both interventions is high, with column 1 of Extended Data Fig. 7 showing 92.1% of parents reporting their child attempted to solve any of the problems in the SMS only group (95% CI 0.903, 0.938; P < 0.001), and slightly higher engagement of 95.2% in the phone call plus SMS group (95% CI 0.939, 0.966; P < 0.001). Table 3 column 3 also shows significant increases in parents’ self-efficacy and perceptions as a result of both interventions. Parents report 4.9 (95% CI 0.7, 9.1; P = 0.023) and 8.6 (95% CI 4.4, 12.9; P < 0.001) percentage points greater self-efficacy in supporting their child’s learning in the SMS only, and phone and SMS group, respectively. We also find that parents’ confidence that their child made progress on their learning increases from 6.6 (95% CI 2.4, 10.9; P = 0.002) to 10.5 (95% CI 6.2, 14.8; P < 0.001) percentage points. Moreover, parents of children in the phone call plus SMS group update their beliefs about their child’s learning level in tandem with their child’s learning progress (see Table 3 columns 1 and 2). These results reveal that parents are engaged in the intervention and notice their child’s progress.
Parents’ engagement in their child’s math learning might displace other educational activities and non-educational activities, such as returning to work when lockdowns were lifted. In column 6 of Table 3, we find no statistically significant educational crowd-out for both interventions, with no reduction in educational engagement overall (β = −0.001, 95% CI −0.019, 0.017; P = 0.933 and β = −0.002, 95% CI −0.020, 0.016; P = 0.809). In column 5, we find no evidence that parental engagement crowds out non-educational activities such as return to work, with no statistically significant increase in unemployment in the SMS plus phone intervention (β = −2.9, 95% CI −6.3, 0.5; P = 0.092).
Altogether, these results show that remote instruction can change parental beliefs and investments, which play an important role in their child’s learning. The Supplementary Information contains details on each of the mechanisms mentioned here, as well details on other robustness checks performed.