|
GCSE Statistics
GCSE statistics coursework, and dealing with multi-dimensional data
by James Nicholson, Belfast Royal Academy
Word
version of this page (with embedded Excel files)
(large 1.2 MB file)
Pdf version of this page
(544
KB file)
Right click and choose "Save Target As" to download the above files. Download
Adobe Acrobat Reader for for free here
Preamble
This page provides electronic copies of the data
sets used in the article, and live links to web sites providing access to other
data sources. The graphs and tables in Excel in the article are all pasted in as
Excel objects rather than pictures, meaning that you will be able to click on
them and open up
the whole spreadsheet to explore them if you want to.
Introduction
One of the things which seems to be causing some problems in the introduction
of the statistics coursework is that pupils are being expected to deal with
complex data sets, where multiple variables are to be considered.
There has been little tradition of data interpretation within the mathematics
curriculum. The main emphasis has been on the mechanical skills of constructing
specified graphs correctly, or ‘reading the graph’ to extract detailed
information.
Indeed, I think it is arguable as to whether we have identified the hierarchical
structure which might allow us to develop interpretative skills in pupils most
effectively. I suggest that exposure to reasonably large numbers of data sets
and graphs that have something to say which is both accessible and of substance
is the way forward.
The advent of the new technology makes it practical, if the teaching resources
are available when the hardware becomes available in the classrooms. In particular,
the requirement now is to make comparisons – whether this is the way a
population has changed over time, or parallel populations, e.g. comparing boys
and girls, different age groups, nationalities, word lengths from different
newspapers, prices of different types and ages of cars.
You may recognize some of the contexts your examination board recommend for
the GCSE coursework
project. Access to data sets which are suitable for students to analyse on a
smaller scale has been limited – they can take up a lot of space in a
traditional textbook, and be tedious to work with even just in terms of data
entry into electronic form. However, pupils need to learn some of the techniques
that will be useful for the larger scale project in a controlled and focussed
context.
The emphasis of the statistical methods we have taught the pupils up to this
point has not been on relationships between variables, or how to describe differences
between groups by considering the distributions of a common variable for the
groups.
I will briefly consider some of the software issues for schools today. Then
I will illustrate some options for
representing data in ways that allow relationships between and within groups
to be visualized. While these graphs are not required components of coursework,
pupils find it difficult to articulate abstract ideas, and I would classify
distribution comparison as abstract for pupils at this age.
If ‘a picture paints a thousand words’ they may find it easier
to talk about comparisons between groups when they have graphs they can refer
to. For example:
stacked or compound bar charts multiple (overlaid) cumulative frequency curves
comparative bar charts
comparative pie charts scatter graphs using different symbols to represent different
groups within the full population comparative box plots.
One of the important requirements in statistics coursework is that calculations
and graphs should be appropriate – if candidates calculate all possible
statistics (such as mean, median, mode, range, interquartile range and standard
deviation) and plot multiple graphical representations they should actually
lose marks! What is expected is that they choose which to use, offer an interpretation
of what is revealed by the statistic or graph, and if possible offer an explanation
of why that representation was chosen. This is not to say that more than one
graph of a set of data should not be used – the example of death rates
below shows that some of the graphs reveal distinctly different aspects of the
data, but others are really alternatives.
In the last section I will give some suggestions for finding data sets which
are accessible.
Software
If pupils are to make use of ICT in data handling, what software is appropriate?
Excel has the advantage that it is universally available in UK schools. It has
disadvantages though in that it may draw inappropriate graphs, and some of the
algorithms it uses are not robust. For example, in plotting time series, it will
plot the moving averages at the last time used in the average instead of in
the middle of the time period.
It can also be extremely awkward to get titles for the graph and axes well
positioned within Excel without the actual graph distorting or becoming very
small in relation to the size of the whole chart object. In this case it may
be more efficient to annotate the printout with these details rather than spend
a long time on the cosmetic details – in coursework it is important that
those details are present, but not that they have to be electronically produced.
Figures 2 and 3 show examples of such a situation.
Minitab is a specialist statistics package and if schools
already have it for use with sixth form statistics courses, it would be worth
considering for use with the statistics coursework with able groups. Details
can be found at http://www.minitab.com/.
The same would apply to any other specialist packages already in use at school
level.
Autograph is a much more versatile piece of software which
can be used for graph plotting and dynamic geometry as well as statistics, and
that versatility makes it well worth considering. The current version will store
a number of datasets, but each has to be treated as a separate entity, rather
than the spreadsheet style of Excel and Minitab.
Proper histograms, with variable width intervals, can be drawn, and comparative
box plots can be constructed. However, the other comparative graphs illustrated
below are not possible at the moment, nor any form of sampling where you want
to take all the variables for the cases in the sample.
This means that it is really not a suitable vehicle for undertaking an analysis
of the types of data files provided by the Boards for coursework – for
example the Mayfield High School data set provided by Edexcel has 27 variables
for about 1200 students altogether. However, the next version of Autograph
is planned to incorporate spreadsheet style entry, and this should make it much
more flexible in its use.
Details can be found at www.autograph-maths.com.
Graphs showing comparative aspects of multi-variable data sets are shown in
Figure 1.
These stacked or compound bar charts illustrate vividly the changes which have
taken place over the past hundred years in terms of life expectancy.
Fig. 1 Stacked bar charts showing death rates over time

*Click the picture above to see a larger version. You can also
download the original working MS
Excel document here (30kb)
A hundred years ago there were very large numbers of infants who did not reach
their first birthday, while this is very much the exception now. (Teachers should
note that this sort of data would not be used in public examinations, where
an individual might have experienced the death of an infant sibling. Awareness
of the circumstances of pupils in their care would be a consideration if thinking
of using such a data set as a teaching resource.)
This data set is taken from Fact File (2002) published by Carel, both in paper
form and on CD-ROM see http://www.carelpress.co.uk/
for details of the 2003 version. All
the data sets in the paper file are available in electronic form on the CD-ROM,
though they are not always in the most accessible form.
Carel also have sample pages from the 2002 version online at
http://www.carelpress.co.uk/
The full data set has 13 age bands, which I have collapsed into six bands in
the tables in order to simplify the graphical picture. While the age bands are
not of equal size, they broadly represent infant, pre-school, child, young adult,
middle age and old age in today’s society (Table 1).
Other graphical representations can highlight different aspects of the data,
as shown in Figure 2.
By grouping the data by age band, it is very easy to see in Figure 2
what the changes over time have been for each band, and whether the changes in
the first half and second half of the century follow similar patterns –
so we see that early deaths have almost disappeared, while the numbers living
to old age have increased dramatically during the 20th century.
The only group where the change in the second half of the century is not in
the same direction as the first
half is in the 45–74 age range where it increased and then decreased.
Fig. 2 Comparative bar charts showing the different time periods together
for each of the age bands (without titles for graph and axes)
*Click the picture above to see a larger version. You can also
download the original working MS
Excel document here (15kb)
If the age bands are grouped together for each of the three time periods, as
in Figure 3, we see more immediately how the pattern of ages
at death has changed. Obviously, all the same information is contained in both
charts, but presenting 6 groups of 3 bars, or 3 groups of 6 bars draws attention
to different aspects as the most prominent feature.
Fig. 3 Comparative bar charts showing the different age bands together
for the three periods (without titles for graph and axes)

*Click the picture above to see a larger version. You can also
download the original working MS
Excel document here (15kb)
The stacked bar chart is generally appropriate where the data is in order,
so that as well as comparing the size of each section, the top of each section
in Figure 1 can be compared across the different periods to
see the cumulative comparisons.
Equivalent pie charts could be drawn for the other time periods. Where the
classes are ordered as in Figure 4, I think the stacked bar
chart is superior, because the order is more starting at 12 o’clock and
then the groups in order clockwise.
Fig. 4 Pie chart for the number of deaths in each age band in 1900–02
*Click the picture above to see a larger version. You can also
download the original working MS
Excel document here (25kb)
The other advantage of the stacked bar chart is that you have an option to
show groups as %’s, so the total height of the three groups would be equal
(and this is directly equivalent to what is shown by a pie chart) or, as I did
above, to show the total frequencies.
Table 1 is appropriate for drawing the stacked bar chart,
and the comparative bar charts, but to draw a cumulative frequency graph it
needs amendment, as shown in Table 2. In particular a decision
needs to be made as to an appropriate upper limit for the ‘over 75’
group.
Table 1 Numbers of deaths per year for different age bands at three
time periods

* You can download the original working
table above in an MS
Excel document here
(30kb)
I have chosen 100 as a reasonable value.
Figure 5 shows the cumulative number of deaths by age at the
three time periods, and it is evident that there are substantial differences.
However, I think comparative cumulative frequency graphs are really quite difficult
to interpret correctly, even for adults, and students find it very difficult
to be confident that they have the direction of comparisons the right way round.
However, from these cumulative frequency graphs, or from the raw data, we can
draw the box plots which are very much easier to interpret.
Fig.5 Overlaid cumulative frequency graphs of the
ages at death in the three time periods
*Click the picture above to see a larger version. You can also
download the original working MS
Excel document here (34kb)
The graph in Figure 6 is produced in Autograph, and the cumulative
frequency curves were used to get the three values needed for plotting the median
and quartiles. The trend towards people living longer is immediately obvious
from the comparative box plots in Figure 6. Box plots give
a minimum of information, but where you are trying to identify trends this reduction
in detail can be a great benefit.
Fig. 6 Comparative box plots for the age at death in the three time
periods
The cumulative frequency graphs in Figure 5 are produced in
Excel by plotting a scatter graph with data points joined by smoothed line (one
of the options available in the chart wizard for scatter graphs). The data for
it is in a rectangular array (Table 2) because all three time
periods had the datacollected to the same age.
Table 2 Data in Table 1 in a form suitable for a cumulative frequency
graph
* You can also download the original
working atble in an MS
Excel document here (31kb)
If you have data belonging to two or more groups which you would like to draw
on a scatter graph, this can be illustrated neatly by putting the different
groups into different columns, so that Excel treats them as different series,
and will use different symbols to plot them.
The data in Table 3 originated in three columns, with fuel
type, age and mileage. To construct Table 3 the data was sorted
by fuel type, and then the mileage values for the diesel cars were highlighted,
then cut and pasted into the block of cells in the next column. When drawing
the graph highlight just the values, but not the row of labels, so that you
can choose what the series identifiers will be.
Table 3 Age and mileage of petrol and diesel cars

* You can also download the original
working table in an MS
Excel document here (11kb)
The facility to observe whether or not groups display different or similar
characteristics goes to the heart of what candidates are being expected to do
in the coursework – and again a graph like Figure 7 may
make it easier for pupils to start talking about comparisons.
Fig. 7 Scatter graph showing petrol and diesel cars separately

*Click the picture above to see a larger version. You can also
download the original working MS
Excel document here (15kb)
Generating the data to display such graphs would be incredibly time consuming
if you had to compile the
comparative statistics ‘by hand’. However, Excel’s Pivot Table
facility offers the opportunity to generate two way tables for large data sets
with a lot of flexibility.
The examples use the CensusAtSchools data for the UK. Mathematics in School,
November 2003 The MA web site www.m-a.org.uk
13
While in the spreadsheet, go to Data > PivotTable Report. Different versions
of Excel use slightly different terminology, and produce slightly different
looking dialog boxes at this stage, but that shown in Figure 8
is for Excel 97.
Fig. 8 Pivot Table dialog box in Excel 97

It is possible to put more than one ‘field’ or variable in the
columns or rows, and get subdivisions into more categories, but the simplest
structure is to use just one in each case. The default is for the field being
analysed to be summarized by ‘Sum of ...’, but you can choose other
summary statistics such as a count for categorical variables, the average or
min/ max or standard deviation, etc.
In this data set, a question as to whether the pupil owned a mobile phone was
coded with 0 for No and 1 for Yes. So ‘average’ will actually report
the proportion of pupils owning a mobile phone. Figure 9 shows
the table produced for boys and girls separately in each year group, and the
cells in the table have been formatted to show the results as a percentage to
1 decimal place.
Fig. 9 Pivot Table showing mobile phone ownership rates

Figure 10 shows the comparative bar chart which will allow
comparisons between year groups, and also between boys and girls to be seen.
Fig. 10 Chart of the Pivot Table summarizing the mobile phone ownership
rates

*Click the picture above to see a larger version. You can also
download the original working MS
Excel document here (578kb)
Sources of Data
I will talk about three places where useful data sets might be accessed:
Libraries
Other curriculum areas
The Internet
The main libraries in most major towns have substantial reference sections,
and generally the staff are more than keen to assist visitors to understand
the range of material they store. Whatever your interests are, they are likely
to be able to locate data relevant to it, and if you start by looking at data
related to your own interests you are more likely to be able to shape it into
something interesting for your classroom.
Academic subject areas such as biology, geography, home economics and physical
education all make use of numerical data with some of the multi-variable characteristics
we want to explore. Discussions with teachers from other subject areas may uncover
opportunities for cross-curricular developments which could benefit learning
in both subject areas. For example, a biology coursework investigation examined
the
effectiveness of an enzyme in breaking down protein at different temperatures.
Protein stains such as tomato sauce, blood, etc. are difficult to get out of
clothes, and ‘biological’ washing powders use enzymes to break down
the protein structure to help the washing powder clean clothes of such stains.
The problem is made rather more complex by the other factors affecting the
effectiveness which result in
substantial variability – things such as the material which was stained,
how old the stain was, etc. The comparative box plots shown in Figure
11, produced in MINITAB®, show the % mass loss
when the enzymes were applied at a variety of temperatures, and it is clear
that there is an optimum temperature, above which the enzyme is less effective
– the greater the % mass loss, the more effective the enzyme is in breaking
down the protein.
Fig. 11 Box plots illustrating the effectiveness of enzymes in breaking
down protein at different temperatures

The data used here is from a student at Belfast Royal Academy, who did this
experiment as an A level biology investigation. For the biologists, the focus
of interest is on the explanation as to why that should be, where the statistician’s
interest is in determining that that happens, and advising the manufacturer
as to where the optimum performance can be obtained.
Examples of the sort of questions which could be posed to pupils with this
data picture would be:
- What was the median % mass loss at 30°C?
- At what temperature was the enzyme most effective in
breaking down protein?
- At what temperature was the mass loss most variable?
- Describe how the % mass lost changes as the temperature
increases.
- Sharon says that she will have to put Jonny’s sweatshirt
into a very hot wash because he has spilt so much
tomato sauce on it. Is she correct?
From the graph the answers would be:
- About 12%.
- 60°C (where the box plot is generally higher than any
other, although it overlaps both the 40° and 50°).
- 50°C (the box plot here is longer from top to bottom
here than anywhere else).
- It seems to increase as the temperature increases, but
also gets more variable, up to 50° then at 60° it is at its
highest, and is quite consistently high, but above 60° it
becomes much less efficient quite
quickly.
- No. We tend to think ‘hotter will be better’ for washing,
but the evidence here is that the enzyme actually loses
its effectiveness in very hot water, and so a medium
temperature is better as well as less expensive!
Apart from data from other academic subjects we should be helping students
to understand how data can inform our understanding of the world we live in.
The death rates data set is one example of this, and there is also a lot of
data available on a country by country basis – in almanacs, and from web
sites such as The World Bank at http://www.worldbank.org/data/.
Here data can be accessed by country, or by topic, and the screen shot in Figure
12 shows some of the range of data which is stored.
Fig. 12 Screen shot of World Bank’s web site

The comparative box plots in Figure 13 were produced in MINITAB®
again, but graphics calculators will also produce them, and software routines
are becoming available to enable such graphs, and proper (variable width!) histograms,
to be drawn in Excel and elsewhere.
Fig. 13 Infant death rates in different regions of the world

Fig. 14 National wealth in different regions of the world

The infant death dates are the number of deaths per 1000 births in the country.
Comparing the two sets of box plots in Figures 13 and 14, we
see that there are big differences between regions for both the national wealth
and the infant death rates. Figure 15 plots a scatter diagram
of the two variables against one
another, for samples from each region.
Fig. 15 Infant death rate v national wealth, by region

By using different symbols to identify the region, we see some startling evidence
about the divisions existing between different parts of the world. In the climate
of international relations existing in the world today, it seems reasonable
that we might consider some of the big citizenship issues around through analysing
data sources such as these.
The most striking impressions are that wealthy countries always have low infant
death rates and that poor countries have very variable rates. Obviously there
are other important factors apart from wealth, and the clustering of symbols
in areas of the graph suggests avenues which could be explored.
There is now an amazing amount of data available on the Internet, although
a lot of it still needs a considerable amount of work to get it into a form
which is usable in theclassroom. Some places where data is available already
in usable form are:
CensusAtSchool:
http://censusatschool.ntu.ac.uk/index.html
has a lot of data at an individual level, and some aggregated data, as well
as activities based on the resources which include comparative data with South
Africa, New Zealand and Australia.
The Royal Statistical Society Centre for Statistical Education has just published
a Toolkit in Data Handling for Projects, which includes a specialist random
data selector.
Commenting on it, the Centre said:
“This toolkit shows how you can enable students to learn data handling
skills through collecting data using the international CensusAtSchool project,
web site and resources. It helps them to produce useful projects with meaningful
conclusions.” See www.censusatschool.ntu.ac.uk for more information about
the project, its data and the Toolkit.
There are also collections of web data source references available. For example:
http://www.mis.coventry.ac.uk/%7Estyrrell/data.php
adopts a different approach. There is a discussion of the range of things you
might be interested in, with web links embedded in the text which is very easy
to read and follow.
http://www.argonet.co.uk/oundlesch/mlink.html#stats
is another collection of data sites by Oundle School, and the site has a lot
of useful mathematics links also.
Kranat et al.
(2001) pp. 262–266 lists some principles for guidance in using the Internet
to find data, as well as sites under different headings. It is available to
download on the web version, and is reproduced by kind permission of Oxford
University Press.
One of the main sources now available is the National Statistics site at: http://www.statistics.gov.uk.
The front page will vary in detail as it shows the new data sets becoming available,
but the screen shot in Figure 16 shows what the structure is,
with the ‘Select Theme’ drop down menu highlighted so you can see
the range of themes which are available.
Fig. 16 National Statistics web site

There is a wealth of data available from this site, on almost every topic in
which government has some
involvement. You can also use this site to view or download local area statistics,
in England and Wales only, for your ward or local authority on a wide range
of subjects including population, crime, health and housing.
A lot of the data needs to be manipulated to make it usable in the classroom,
but over time the data available here will provide a rich resource.
Summary
There is a scarcity of materials where pupils get the opportunity to reason
with multi-variable sets of data. We also need to provide guidance as to the
sort of language that is appropriate in interpretation. For example where a
scatter graph shows strong correlation, we should caution against assuming a
direct causal relationship without contextual reasons.
Where the scatter graph shows only moderate correlation, we know there must
be other factors at work also. There is an opportunity now to use new technology
to help overcome some of these problems. Many of these representations reveal
themselves properly only in colour. Viewed in black and white, or even the two
colour printing now in many textbooks, the relationships are much harder to
visualize.
The increasing availability of projection facilities in classrooms means pupils
should be able to experience a much wider variety of these displays, and respond
to focussed questions about them. This might act as preparation for working
in the more open-ended environment of a statistics project where they decide
which graphical representations would be most appropriate.
These comparative graphs are not requirements of the new coursework, but they
offer pupils a way into articulating comparisons which may be useful. However,
all of this should not just be seen as a response
to the demands of the new coursework. If we want our young people to grow into
intelligent and critical consumers of data, we need to develop resources that
offer them the opportunity to develop the necessary skills, so I believe that
these skills are inherently worthwhile.
Acknowledgements
Portions of MINITAB Statistical Software output contained in this article are
printed with permission of Minitab Inc.
MINITAB® is a registered trademark of Minitab Inc.
The author would like to thank Neville Davies and Douglas Butler for their constructive
comments on a draft of this article.
References
Kranat, J., Housden, B. and Nicholson, J.R. 2001 Statistics GCSE for AQA, Oxford
University Press, Oxford.
Keywords: Data interpretation; Graphs; Statistics coursework.
Author
James Nicholson, Belfast Royal Academy, Cliftonville Road, Belfast BT14 6JL.
e-mail: j.r.nicholson@ntlworld.coms
14 Mathematics in School, November 2003 The MA web site www.m-a.org.uk
Use of the Internet
The Internet is a vast resource available, mostly free of charge, to everyone,
but its very size can be very frustrating – how do you find the information
you want, and in a useable form?
- If your search is too specific, and doesn’t use exactly the keywords
the site creators
thought of, then you may not turn up relevant sites.
- If your search is too general you can turn up literally millions of sites,
and you can’t
find the most relevant ones.
Since the Internet is literally growing and changing day by day, there is no
guarantee that a site that is here today, will still be here tomorrow.
*You will find lots of data which you might like to work with,
which will take a lot of effort to get into a form you can use.
For that reason, this section will give some principles to guide you in
using the Internet, as well as identifying the ‘best sites’ currently
available.
Really, the opportunities are endless, but it will help if you have a good
idea
of what information you want before you start.
The information here is given in four categories:
- Established organisations
- Data warehouses
- Specialised sites
- Miscellaneous
1 Established organisations
*If the actual link is not still live you can always start at the
home site.
If someone is doing the hard work of periodically updating information in a
field you are interested in, it can save you a huge amount of time, and frustrated
effort. Here are some examples which are likely to exist for a long time:
2 Data warehouses
- Data warehouses collect a lot of information on related matters together
in one place. Here are some examples:
- The World Bank has a lot of data which is relevant to
a number of curriculum study areas which use data, such as Geography,
Economics, Sociology etc. The country tables provide data on more than 50
indicators for 206 of the world's countries. Other links take you to multiple
data sets relating to issues such as AIDS and climate change. A good starting
point for this currently is: http://www.worldbank.org/html/schools/data.htm
- Unicef have another large collection of data, specifically about women and
children in countries around the world. Go to http://www.unicef.org/statistics/index.html
- The Broadcasters’ Audience Research Board stores data on TV viewing
figures, which are available on a monthly and weekly basis, and
available on a national or a regional basis.
- Go to http://www.barb.co.uk
Currently you have to ‘register’ by entering some contact details,
but
there is no fee for accessing the data on their site.
* Barb do have much more information which they will sell to businesses
wishing to analyse viewing habits in order to target advertising efficiently,
or schedule programming.
- The Journal of Statistics Education is an online journal which is published
free. The archives are
available freely and one of its regular features is datasets, which come with
explanations of the context. Go to http://www.amstat.org/publications/jse
3 Specialised sites
These sites give information about a specific subject, or a particular company.
- The Automobile Association. Go to http://www.theaa.com
for information about all sorts of things related to motoring – you
can access information about prices of second hand cars, insurance and so
on.
- The National Lottery. Go to http://lottery.merseyworld.com
to get information about the winning numbers, the number of jackpot winners
in each draw, the ticket sales etc. – all sorts of related data sets
which provide opportunities for statistical investigations.
- Polling and market research organisations are in the data business, so as
you might expect their web sites offer a lot of data! Go to http://www.gallup.com
or http://www.mori.com or
http://www.nop.co.uk for
archived data on surveys on just about any topic.
- The Government. Go to http://www.statistics.gov.uk
This gives you access to a lot of national and regional statistics on themes
such as crime and justice, health and car, the economy, education, transport
and many others. There are datasets here in Excel and in csv format, either
of which you can save on disk. The csv format is useful if you want to use
a spreadsheet other than Excel.
*To open a csv file in Excel:
In the Type of files menu, select All Files in the folder you have saved the
file into, and then select the file.
The screenshot shows what the Excel sheet looks like when it opens.

The Schools Census. Go to http://censusatschool.ntu.ac.uk/default.asp
and a screen like this should appear:

This is a site with a lot of information collected about school pupils in the
United Kingdom, linked to the (adult) National Census in April 2001. It has
datasets available for download already in Excel.
Downloading the Town Excel file gives you access to this information:

4 Miscellaneous
You can search the web for hobbies and sports clubs – use a search engine
to find the sites of your favourite teams, or the sites of the national organisation
for your favourite sport, and explore the links available. Match programmes,
or advertising material will also carry the web address.
- The Guinness Book of Records has an online version available at http://www.guinnessworldrecords.com/home.asp
where you can get statistical information about the smallest, biggest, fastest
everything in the world.
- Another similar site is at http://www.dartmouth.edu/~chance/
then follow link to Chance News, which contains archived material back to
1992. The material here tends to be US orientated but the content is very
good, if you are interested in bigger statistical questions and issues.
- This site takes you through setting up your own opinion poll step by step:
http:/www.opinionpower.com
|