Solve blog

Automatically counting crowds at Mona Foma using computer vision

Harvey Nguyen 
Brenton Crawford


The Museum of Old and New Art (Mona), located in Hobart, Tasmania, is the largest, privately funded art gallery in the southern hemisphere. In addition to the gallery itself, Mona hosts a number of events and two annual music and arts festivals, Dark Mofo and Mona Foma

These events take place in public spaces throughout Hobart and Launceston. The bigger public spaces are sometimes not suitable for Mona’s usual ticketing infrastructure, which makes it difficult for organisers to accurately determine attendance figures at some of the large-scale events. 


Our brief was to find a way to use photographs taken from a variety of Mona staff smartphones and cameras, and provide an automated solution, delivered through a web app, that could accurately estimate the size of the crowd. 

Crowd counting/estimation is a well-studied field in computer vision and machine learning, with many great algorithms and libraries to choose from and experiment with.

Finding the right model

Based on the types of crowd images we were expecting, we needed a model that could work well with:

  • crowds between 500-3000 people, 
  • densely packed crowds, often with patchy distribution, 

  • partial to significant occlusion (only partial parts of a person visible in the crowd), and 
  • images with significant variation in terms of perspective, resolution and lighting conditions.

Crowd image test datasets

In order to establish which model would prove most appropriate, we needed some datasets to test the model against. There were several available (Figure 1), but the dataset that we deemed most appropriate was the UCF_QNRF dataset (Idrees et al., 2018). Introduced in 2018, UCF_QNRF had the most similar variations, in terms of lighting, perspective and crowds with patchy distribution (Figure 2), to what we were expecting the Mona Foma model to experience in production.

Figure 1 Datasets comparison

Figure 2 Example crowd images from UCF-QNRF – A Large Crowd Counting Data Set

The model

Based on the low perspectives of the crowd shots and the significant occlusion of heads and bodies (essentially people blocking a full view of the people behind them), we determined the most optimal crowd counting model would be some kind of density estimation-based method. Density map estimation methods aim to estimate the crowd count by predicting the crowd density map of an image rather than explicitly detecting each person.

After trialling two models—CSRNet (Yuhong Li et al., 2018) and SPANet (Cheng et al., 2019)—we settled on a model called Bayesian Loss for Crowd Count Estimation with Point Supervision by Ma et al., 2019.

This is a new model published in August 2019 that has impressive performance on the previously mentioned UCF_QNRF dataset.

Figure 3 Bayesian Loss for Crowd Count Estimation with Point Supervision (BL) has the best performance on UCF_QNRF with a high margin over the second and third place models

Validation on Mona Foma dataset

To validate the accuracy of the model in the field, a series of images were taken from different distances and perspectives of the same crowd within a short period of time (Figure 4). Then a manual count of image A was performed for comparison. This manual count resulted in a crowd size estimate of 1083 with an estimated error of +/- 50, given some people are occluded or obstructed from view and can be subjective to count. The manual count was time consuming, approximately 1 hour, and difficult to undertake.

Overall, the results on the test images were impressive, with counts of 1022, 1147, 1054 and 1061 being returned for images A to D respectively (Figure 4). The average of these images was 1071 (12 less than the manual count).

Because of the occlusion problem with images such as these, we think that taking several images from different angles and orientations of the crowd and averaging these counts should yield the most accurate result.

Figure 4 Four images taken from different distances and perspectives of a Mona Foma 2020 event. Image A was manually counted and returned a count of 1083


The bayesian loss model was straightforward to implement, as a model pre-trained on UCF_QNRF was publicly available from the Authors

However, this solution needed to be provided as a web-based application for the customer. Solve achieved this by using the Flask framework to build a fast, lightweight web application. It was hosted on an AWS EC2 server via Elastic Beanstalk. It is worth noting that, after some trial and error, we discovered that a 32Gb instance was required to process high resolution images for some larger scenes (real photo resolution can be up to 3000×4000).

Mona staff were able to successfully use the web application (Figure 5) to acurately estimate crowds at select, large-scale events during Mona Foma 2020 where their usual ticketing or manual methods were not able to be effectively used.


Composition Loss for Counting, Density Map Estimation and Localization in Dense Crowds H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maddeed, N. Rajpoot, M. Shah,, in Proceedings of IEEE European Conference on Computer Vision (ECCV 2018), Munich, Germany, September 8-14, 2018.
Learning Spatial Awareness to Improve Crowd Counting (SPANet) – Zhi-Qi ChengJun-Xiu LiQi DaiXiao WuAlexander Hauptmann – Learning Spatial Awareness to Improve Crowd Counting
Bayesian Loss for Crowd Count Estimation with Point Supervision – Zhiheng MaXing WeiXiaopeng HongYihong Gong – Bayesian Loss for Crowd Count Estimation with Point Supervision.
CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes – Yuhong LiXiaofan ZhangDeming Chen – CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes

Figure 5 Visualisation of the resulting crowd counting web application