The Maths Association The Mathematical Association - supporting mathematics in education
    Home  |  Contact Us  |  Join the MA  |  What's New  |  Site Map
Search:



GCSE Statistics

GCSE statistics coursework, and dealing with multi-dimensional data

by James Nicholson, Belfast Royal Academy

Word version of this page (with embedded Excel files) (large 1.2 MB file)
Pdf version of this page (544 KB file)

Right click and choose "Save Target As" to download the above files. Download Adobe Acrobat Reader for for free here

Preamble

This page provides electronic copies of the data sets used in the article, and live links to web sites providing access to other data sources. The graphs and tables in Excel in the article are all pasted in as Excel objects rather than pictures, meaning that you will be able to click on them and open up the whole spreadsheet to explore them if you want to.

Introduction

One of the things which seems to be causing some problems in the introduction of the statistics coursework is that pupils are being expected to deal with complex data sets, where multiple variables are to be considered.

There has been little tradition of data interpretation within the mathematics curriculum. The main emphasis has been on the mechanical skills of constructing specified graphs correctly, or ‘reading the graph’ to extract detailed information.

Indeed, I think it is arguable as to whether we have identified the hierarchical structure which might allow us to develop interpretative skills in pupils most effectively. I suggest that exposure to reasonably large numbers of data sets and graphs that have something to say which is both accessible and of substance is the way forward.

The advent of the new technology makes it practical, if the teaching resources are available when the hardware becomes available in the classrooms. In particular, the requirement now is to make comparisons – whether this is the way a population has changed over time, or parallel populations, e.g. comparing boys and girls, different age groups, nationalities, word lengths from different newspapers, prices of different types and ages of cars.

You may recognize some of the contexts your examination board recommend for the GCSE coursework
project. Access to data sets which are suitable for students to analyse on a smaller scale has been limited – they can take up a lot of space in a traditional textbook, and be tedious to work with even just in terms of data entry into electronic form. However, pupils need to learn some of the techniques that will be useful for the larger scale project in a controlled and focussed context.

The emphasis of the statistical methods we have taught the pupils up to this point has not been on relationships between variables, or how to describe differences between groups by considering the distributions of a common variable for the groups.

I will briefly consider some of the software issues for schools today. Then I will illustrate some options for
representing data in ways that allow relationships between and within groups to be visualized. While these graphs are not required components of coursework, pupils find it difficult to articulate abstract ideas, and I would classify distribution comparison as abstract for pupils at this age.

If ‘a picture paints a thousand words’ they may find it easier to talk about comparisons between groups when they have graphs they can refer to. For example:
stacked or compound bar charts multiple (overlaid) cumulative frequency curves comparative bar charts
comparative pie charts scatter graphs using different symbols to represent different groups within the full population comparative box plots.

One of the important requirements in statistics coursework is that calculations and graphs should be appropriate – if candidates calculate all possible statistics (such as mean, median, mode, range, interquartile range and standard deviation) and plot multiple graphical representations they should actually lose marks! What is expected is that they choose which to use, offer an interpretation of what is revealed by the statistic or graph, and if possible offer an explanation of why that representation was chosen. This is not to say that more than one graph of a set of data should not be used – the example of death rates below shows that some of the graphs reveal distinctly different aspects of the data, but others are really alternatives.

In the last section I will give some suggestions for finding data sets which are accessible.

Software

If pupils are to make use of ICT in data handling, what software is appropriate?
Excel has the advantage that it is universally available in UK schools. It has disadvantages though in that it may draw inappropriate graphs, and some of the algorithms it uses are not robust. For example, in plotting time series, it will plot the moving averages at the last time used in the average instead of in the middle of the time period.

It can also be extremely awkward to get titles for the graph and axes well positioned within Excel without the actual graph distorting or becoming very small in relation to the size of the whole chart object. In this case it may be more efficient to annotate the printout with these details rather than spend a long time on the cosmetic details – in coursework it is important that those details are present, but not that they have to be electronically produced. Figures 2 and 3 show examples of such a situation.

Minitab is a specialist statistics package and if schools already have it for use with sixth form statistics courses, it would be worth considering for use with the statistics coursework with able groups. Details can be found at http://www.minitab.com/.
The same would apply to any other specialist packages already in use at school level.

Autograph is a much more versatile piece of software which can be used for graph plotting and dynamic geometry as well as statistics, and that versatility makes it well worth considering. The current version will store a number of datasets, but each has to be treated as a separate entity, rather than the spreadsheet style of Excel and Minitab.

Proper histograms, with variable width intervals, can be drawn, and comparative box plots can be constructed. However, the other comparative graphs illustrated below are not possible at the moment, nor any form of sampling where you want to take all the variables for the cases in the sample.

This means that it is really not a suitable vehicle for undertaking an analysis of the types of data files provided by the Boards for coursework – for example the Mayfield High School data set provided by Edexcel has 27 variables for about 1200 students altogether. However, the next version of Autograph
is planned to incorporate spreadsheet style entry, and this should make it much more flexible in its use.

Details can be found at www.autograph-maths.com.
Graphs showing comparative aspects of multi-variable data sets are shown in Figure 1.
These stacked or compound bar charts illustrate vividly the changes which have taken place over the past hundred years in terms of life expectancy.

Fig. 1 Stacked bar charts showing death rates over time

*Click the picture above to see a larger version. You can also download the original working MS Excel document here  (30kb)

A hundred years ago there were very large numbers of infants who did not reach their first birthday, while this is very much the exception now. (Teachers should note that this sort of data would not be used in public examinations, where an individual might have experienced the death of an infant sibling. Awareness of the circumstances of pupils in their care would be a consideration if thinking of using such a data set as a teaching resource.)

This data set is taken from Fact File (2002) published by Carel, both in paper form and on CD-ROM see http://www.carelpress.co.uk/ for details of the 2003 version. All the data sets in the paper file are available in electronic form on the CD-ROM, though they are not always in the most accessible form.

Carel also have sample pages from the 2002 version online at http://www.carelpress.co.uk/

The full data set has 13 age bands, which I have collapsed into six bands in the tables in order to simplify the graphical picture. While the age bands are not of equal size, they broadly represent infant, pre-school, child, young adult, middle age and old age in today’s society (Table 1).

Other graphical representations can highlight different aspects of the data, as shown in Figure 2.
By grouping the data by age band, it is very easy to see in Figure 2 what the changes over time have been for each band, and whether the changes in the first half and second half of the century follow similar patterns – so we see that early deaths have almost disappeared, while the numbers living to old age have increased dramatically during the 20th century.
The only group where the change in the second half of the century is not in the same direction as the first
half is in the 45–74 age range where it increased and then decreased.

Fig. 2 Comparative bar charts showing the different time periods together for each of the age bands (without titles for graph and axes)

*Click the picture above to see a larger version. You can also download the original working MS Excel document here  (15kb)

If the age bands are grouped together for each of the three time periods, as in Figure 3, we see more immediately how the pattern of ages at death has changed. Obviously, all the same information is contained in both charts, but presenting 6 groups of 3 bars, or 3 groups of 6 bars draws attention to different aspects as the most prominent feature.

Fig. 3 Comparative bar charts showing the different age bands together for the three periods (without titles for graph and axes)

*Click the picture above to see a larger version. You can also download the original working MS Excel document here  (15kb)

The stacked bar chart is generally appropriate where the data is in order, so that as well as comparing the size of each section, the top of each section in Figure 1 can be compared across the different periods to see the cumulative comparisons.

Equivalent pie charts could be drawn for the other time periods. Where the classes are ordered as in Figure 4, I think the stacked bar chart is superior, because the order is more starting at 12 o’clock and then the groups in order clockwise.

Fig. 4 Pie chart for the number of deaths in each age band in 1900–02

*Click the picture above to see a larger version. You can also download the original working MS Excel document here  (25kb)

The other advantage of the stacked bar chart is that you have an option to show groups as %’s, so the total height of the three groups would be equal (and this is directly equivalent to what is shown by a pie chart) or, as I did above, to show the total frequencies.

Table 1 is appropriate for drawing the stacked bar chart, and the comparative bar charts, but to draw a cumulative frequency graph it needs amendment, as shown in Table 2. In particular a decision needs to be made as to an appropriate upper limit for the ‘over 75’ group.

Table 1 Numbers of deaths per year for different age bands at three time periods

* You can download the original working table above in an  MS Excel document here (30kb)

I have chosen 100 as a reasonable value.
Figure 5 shows the cumulative number of deaths by age at the three time periods, and it is evident that there are substantial differences. However, I think comparative cumulative frequency graphs are really quite difficult to interpret correctly, even for adults, and students find it very difficult to be confident that they have the direction of comparisons the right way round. However, from these cumulative frequency graphs, or from the raw data, we can draw the box plots which are very much easier to interpret.

Fig.5 Overlaid cumulative frequency graphs of the ages at death in the three time periods

*Click the picture above to see a larger version. You can also download the original working MS Excel document here  (34kb)

The graph in Figure 6 is produced in Autograph, and the cumulative frequency curves were used to get the three values needed for plotting the median and quartiles. The trend towards people living longer is immediately obvious from the comparative box plots in Figure 6. Box plots give a minimum of information, but where you are trying to identify trends this reduction in detail can be a great benefit.

Fig. 6 Comparative box plots for the age at death in the three time periods

The cumulative frequency graphs in Figure 5 are produced in Excel by plotting a scatter graph with data points joined by smoothed line (one of the options available in the chart wizard for scatter graphs). The data for it is in a rectangular array (Table 2) because all three time periods had the datacollected to the same age.

Table 2 Data in Table 1 in a form suitable for a cumulative frequency graph

* You can also download the original working atble in an  MS Excel document here  (31kb)

If you have data belonging to two or more groups which you would like to draw on a scatter graph, this can be illustrated neatly by putting the different groups into different columns, so that Excel treats them as different series, and will use different symbols to plot them.

The data in Table 3 originated in three columns, with fuel type, age and mileage. To construct Table 3 the data was sorted by fuel type, and then the mileage values for the diesel cars were highlighted, then cut and pasted into the block of cells in the next column. When drawing the graph highlight just the values, but not the row of labels, so that you can choose what the series identifiers will be.

Table 3 Age and mileage of petrol and diesel cars

* You can also download the original working table in an  MS Excel document here  (11kb)

The facility to observe whether or not groups display different or similar characteristics goes to the heart of what candidates are being expected to do in the coursework – and again a graph like Figure 7 may make it easier for pupils to start talking about comparisons.

Fig. 7 Scatter graph showing petrol and diesel cars separately


*Click the picture above to see a larger version. You can also download the original working MS Excel document here  (15kb)

Generating the data to display such graphs would be incredibly time consuming if you had to compile the
comparative statistics ‘by hand’. However, Excel’s Pivot Table facility offers the opportunity to generate two way tables for large data sets with a lot of flexibility.
The examples use the CensusAtSchools data for the UK. Mathematics in School, November 2003 The MA web site www.m-a.org.uk 13

While in the spreadsheet, go to Data > PivotTable Report. Different versions of Excel use slightly different terminology, and produce slightly different looking dialog boxes at this stage, but that shown in Figure 8 is for Excel 97.

Fig. 8 Pivot Table dialog box in Excel 97

It is possible to put more than one ‘field’ or variable in the columns or rows, and get subdivisions into more categories, but the simplest structure is to use just one in each case. The default is for the field being analysed to be summarized by ‘Sum of ...’, but you can choose other summary statistics such as a count for categorical variables, the average or min/ max or standard deviation, etc.

In this data set, a question as to whether the pupil owned a mobile phone was coded with 0 for No and 1 for Yes. So ‘average’ will actually report the proportion of pupils owning a mobile phone. Figure 9 shows the table produced for boys and girls separately in each year group, and the cells in the table have been formatted to show the results as a percentage to 1 decimal place.

Fig. 9 Pivot Table showing mobile phone ownership rates

Figure 10 shows the comparative bar chart which will allow comparisons between year groups, and also between boys and girls to be seen.

Fig. 10 Chart of the Pivot Table summarizing the mobile phone ownership rates

*Click the picture above to see a larger version. You can also download the original working MS Excel document here  (578kb)

 

Sources of Data

I will talk about three places where useful data sets might be accessed:
Libraries
Other curriculum areas
The Internet

The main libraries in most major towns have substantial reference sections, and generally the staff are more than keen to assist visitors to understand the range of material they store. Whatever your interests are, they are likely to be able to locate data relevant to it, and if you start by looking at data related to your own interests you are more likely to be able to shape it into something interesting for your classroom.

Academic subject areas such as biology, geography, home economics and physical education all make use of numerical data with some of the multi-variable characteristics we want to explore. Discussions with teachers from other subject areas may uncover opportunities for cross-curricular developments which could benefit learning in both subject areas. For example, a biology coursework investigation examined the
effectiveness of an enzyme in breaking down protein at different temperatures. Protein stains such as tomato sauce, blood, etc. are difficult to get out of clothes, and ‘biological’ washing powders use enzymes to break down the protein structure to help the washing powder clean clothes of such stains.

The problem is made rather more complex by the other factors affecting the effectiveness which result in
substantial variability – things such as the material which was stained, how old the stain was, etc. The comparative box plots shown in Figure 11, produced in MINITAB®, show the % mass loss when the enzymes were applied at a variety of temperatures, and it is clear that there is an optimum temperature, above which the enzyme is less effective – the greater the % mass loss, the more effective the enzyme is in breaking down the protein.

Fig. 11 Box plots illustrating the effectiveness of enzymes in breaking down protein at different temperatures

The data used here is from a student at Belfast Royal Academy, who did this experiment as an A level biology investigation. For the biologists, the focus of interest is on the explanation as to why that should be, where the statistician’s interest is in determining that that happens, and advising the manufacturer as to where the optimum performance can be obtained.

Examples of the sort of questions which could be posed to pupils with this data picture would be:

  1. What was the median % mass loss at 30°C?

  2. At what temperature was the enzyme most effective in breaking down protein?

  3. At what temperature was the mass loss most variable?

  4. Describe how the % mass lost changes as the temperature
    increases.

  5. Sharon says that she will have to put Jonny’s sweatshirt
    into a very hot wash because he has spilt so much
    tomato sauce on it. Is she correct?

From the graph the answers would be:
  1. About 12%.
  2. 60°C (where the box plot is generally higher than any
    other, although it overlaps both the 40° and 50°).

  3. 50°C (the box plot here is longer from top to bottom
    here than anywhere else).

  4. It seems to increase as the temperature increases, but
    also gets more variable, up to 50° then at 60° it is at its
    highest, and is quite consistently high, but above 60° it
    becomes much less efficient quite quickly.

  5. No. We tend to think ‘hotter will be better’ for washing,
    but the evidence here is that the enzyme actually loses
    its effectiveness in very hot water, and so a medium
    temperature is better as well as less expensive!

Apart from data from other academic subjects we should be helping students to understand how data can inform our understanding of the world we live in. The death rates data set is one example of this, and there is also a lot of data available on a country by country basis – in almanacs, and from web sites such as The World Bank at http://www.worldbank.org/data/.
Here data can be accessed by country, or by topic, and the screen shot in Figure 12 shows some of the range of data which is stored.

Fig. 12 Screen shot of World Bank’s web site

The comparative box plots in Figure 13 were produced in MINITAB® again, but graphics calculators will also produce them, and software routines are becoming available to enable such graphs, and proper (variable width!) histograms, to be drawn in Excel and elsewhere.

Fig. 13 Infant death rates in different regions of the world

Fig. 14 National wealth in different regions of the world

The infant death dates are the number of deaths per 1000 births in the country. Comparing the two sets of box plots in Figures 13 and 14, we see that there are big differences between regions for both the national wealth and the infant death rates. Figure 15 plots a scatter diagram of the two variables against one
another, for samples from each region.

Fig. 15 Infant death rate v national wealth, by region

By using different symbols to identify the region, we see some startling evidence about the divisions existing between different parts of the world. In the climate of international relations existing in the world today, it seems reasonable that we might consider some of the big citizenship issues around through analysing data sources such as these.

The most striking impressions are that wealthy countries always have low infant death rates and that poor countries have very variable rates. Obviously there are other important factors apart from wealth, and the clustering of symbols in areas of the graph suggests avenues which could be explored.

There is now an amazing amount of data available on the Internet, although a lot of it still needs a considerable amount of work to get it into a form which is usable in theclassroom. Some places where data is available already in usable form are:

CensusAtSchool:
http://censusatschool.ntu.ac.uk/index.html
has a lot of data at an individual level, and some aggregated data, as well as activities based on the resources which include comparative data with South Africa, New Zealand and Australia.
The Royal Statistical Society Centre for Statistical Education has just published a Toolkit in Data Handling for Projects, which includes a specialist random data selector.

Commenting on it, the Centre said:
“This toolkit shows how you can enable students to learn data handling skills through collecting data using the international CensusAtSchool project, web site and resources. It helps them to produce useful projects with meaningful conclusions.” See www.censusatschool.ntu.ac.uk for more information about the project, its data and the Toolkit.

There are also collections of web data source references available. For example:

http://www.mis.coventry.ac.uk/%7Estyrrell/data.php
adopts a different approach. There is a discussion of the range of things you might be interested in, with web links embedded in the text which is very easy to read and follow.

http://www.argonet.co.uk/oundlesch/mlink.html#stats
is another collection of data sites by Oundle School, and the site has a lot of useful mathematics links also.

Kranat et al.

(2001) pp. 262–266 lists some principles for guidance in using the Internet to find data, as well as sites under different headings. It is available to download on the web version, and is reproduced by kind permission of Oxford University Press.

One of the main sources now available is the National Statistics site at: http://www.statistics.gov.uk. The front page will vary in detail as it shows the new data sets becoming available, but the screen shot in Figure 16 shows what the structure is, with the ‘Select Theme’ drop down menu highlighted so you can see the range of themes which are available.

Fig. 16 National Statistics web site

There is a wealth of data available from this site, on almost every topic in which government has some
involvement. You can also use this site to view or download local area statistics, in England and Wales only, for your ward or local authority on a wide range of subjects including population, crime, health and housing.

A lot of the data needs to be manipulated to make it usable in the classroom, but over time the data available here will provide a rich resource.

Summary

There is a scarcity of materials where pupils get the opportunity to reason with multi-variable sets of data. We also need to provide guidance as to the sort of language that is appropriate in interpretation. For example where a scatter graph shows strong correlation, we should caution against assuming a direct causal relationship without contextual reasons.

Where the scatter graph shows only moderate correlation, we know there must be other factors at work also. There is an opportunity now to use new technology to help overcome some of these problems. Many of these representations reveal themselves properly only in colour. Viewed in black and white, or even the two colour printing now in many textbooks, the relationships are much harder to visualize.

The increasing availability of projection facilities in classrooms means pupils should be able to experience a much wider variety of these displays, and respond to focussed questions about them. This might act as preparation for working in the more open-ended environment of a statistics project where they decide which graphical representations would be most appropriate.

These comparative graphs are not requirements of the new coursework, but they offer pupils a way into articulating comparisons which may be useful. However, all of this should not just be seen as a response
to the demands of the new coursework. If we want our young people to grow into intelligent and critical consumers of data, we need to develop resources that offer them the opportunity to develop the necessary skills, so I believe that these skills are inherently worthwhile.

Acknowledgements

Portions of MINITAB Statistical Software output contained in this article are printed with permission of Minitab Inc.
MINITAB® is a registered trademark of Minitab Inc.
The author would like to thank Neville Davies and Douglas Butler for their constructive comments on a draft of this article.

References

Kranat, J., Housden, B. and Nicholson, J.R. 2001 Statistics GCSE for AQA, Oxford University Press, Oxford.

Keywords: Data interpretation; Graphs; Statistics coursework.

Author
James Nicholson, Belfast Royal Academy, Cliftonville Road, Belfast BT14 6JL.

e-mail: j.r.nicholson@ntlworld.coms

14 Mathematics in School, November 2003 The MA web site www.m-a.org.uk

 

 

Use of the Internet

The Internet is a vast resource available, mostly free of charge, to everyone, but its very size can be very frustrating – how do you find the information you want, and in a useable form?

  • If your search is too specific, and doesn’t use exactly the keywords the site creators
    thought of, then you may not turn up relevant sites.
  • If your search is too general you can turn up literally millions of sites, and you can’t
    find the most relevant ones.

Since the Internet is literally growing and changing day by day, there is no
guarantee that a site that is here today, will still be here tomorrow.

*You will find lots of data which you might like to work with, which will take a lot of effort to get into a form you can use.

For that reason, this section will give some principles to guide you in
using the Internet, as well as identifying the ‘best sites’ currently available.

Really, the opportunities are endless, but it will help if you have a good idea
of what information you want before you start.

The information here is given in four categories:

  1. Established organisations
  2. Data warehouses
  3. Specialised sites
  4. Miscellaneous


1 Established organisations

*If the actual link is not still live you can always start at the home site.

If someone is doing the hard work of periodically updating information in a field you are interested in, it can save you a huge amount of time, and frustrated effort. Here are some examples which are likely to exist for a long time:

2 Data warehouses

  • Data warehouses collect a lot of information on related matters together in one place. Here are some examples:
  • The World Bank has a lot of data which is relevant to a number of curriculum study areas which use data, such as Geography, Economics, Sociology etc. The country tables provide data on more than 50 indicators for 206 of the world's countries. Other links take you to multiple data sets relating to issues such as AIDS and climate change. A good starting point for this currently is: http://www.worldbank.org/html/schools/data.htm
  • Unicef have another large collection of data, specifically about women and children in countries around the world. Go to http://www.unicef.org/statistics/index.html
  • The Broadcasters’ Audience Research Board stores data on TV viewing
    figures, which are available on a monthly and weekly basis, and
    available on a national or a regional basis.
  • Go to http://www.barb.co.uk
    Currently you have to ‘register’ by entering some contact details, but
    there is no fee for accessing the data on their site.
    * Barb do have much more information which they will sell to businesses wishing to analyse viewing habits in order to target advertising efficiently, or schedule programming.
  • The Journal of Statistics Education is an online journal which is published
    free. The archives are available freely and one of its regular features is datasets, which come with explanations of the context. Go to http://www.amstat.org/publications/jse

3 Specialised sites

These sites give information about a specific subject, or a particular company.

  • The Automobile Association. Go to http://www.theaa.com for information about all sorts of things related to motoring – you can access information about prices of second hand cars, insurance and so on.
  • The National Lottery. Go to http://lottery.merseyworld.com to get information about the winning numbers, the number of jackpot winners in each draw, the ticket sales etc. – all sorts of related data sets which provide opportunities for statistical investigations.
  • Polling and market research organisations are in the data business, so as you might expect their web sites offer a lot of data! Go to http://www.gallup.com or http://www.mori.com or http://www.nop.co.uk for archived data on surveys on just about any topic.
  • The Government. Go to http://www.statistics.gov.uk
    This gives you access to a lot of national and regional statistics on themes such as crime and justice, health and car, the economy, education, transport and many others. There are datasets here in Excel and in csv format, either of which you can save on disk. The csv format is useful if you want to use a spreadsheet other than Excel.
    *To open a csv file in Excel:
    In the Type of files menu, select All Files in the folder you have saved the file into, and then select the file.

The screenshot shows what the Excel sheet looks like when it opens.

The Schools Census. Go to http://censusatschool.ntu.ac.uk/default.asp and a screen like this should appear:


This is a site with a lot of information collected about school pupils in the United Kingdom, linked to the (adult) National Census in April 2001. It has datasets available for download already in Excel.

Downloading the Town Excel file gives you access to this information:

4 Miscellaneous

You can search the web for hobbies and sports clubs – use a search engine to find the sites of your favourite teams, or the sites of the national organisation for your favourite sport, and explore the links available. Match programmes, or advertising material will also carry the web address.

  • The Guinness Book of Records has an online version available at http://www.guinnessworldrecords.com/home.asp where you can get statistical information about the smallest, biggest, fastest everything in the world.
  • Another similar site is at http://www.dartmouth.edu/~chance/ then follow link to Chance News, which contains archived material back to 1992. The material here tends to be US orientated but the content is very good, if you are interested in bigger statistical questions and issues.
  • This site takes you through setting up your own opinion poll step by step:
    http:/www.opinionpower.com