Mearka - Architecting and evaluation of a Sports Video Tagging Software Toolkit

Tagging involves annotating metadata for specific video sequences and events, and this tagged metadata is subsequently used in the causal analysis process. Tagging involves annotating metadata for specific video sequences and events, and this tagged metadata is subsequently used in the causal analysis process.

Background and Motivation

Here it is possible to get “the best of both worlds”, since MSE is only used as long as the error is less than𝛿, and MAE otherwise. Each sub-part of the figure is indicated by (.), for example the Mearka app in Figure 5.1 is indicated by (1.2).

Mearka Problem Definition

Methods

The engineer is expected to repeat these steps, for example, when tests show that the system does not meet the specified requirements. Finally, the system will be evaluated by experiments to determine whether the POC meets the mentioned requirements.

Scope and Limitations

This thesis is rooted in the design paradigm, as the requirement and specification are derived from the problem definition and the application domain.

Context

This study presents their experiences of using radio-based wearable positioning data systems in elite football clubs (ZXY [6]). This research is just a fraction of what the CSG at UiT, the Arctic University of Norway, has conducted over the years.

Outline

Mearka will be designed and developed in the context of the research already done by CSG on distributed systems, to work with video as a source, and to potentially detect positions to quantify soccer performance that can be used in analytics. These resources can be represented using Uniform Resource Identifiers (URIs), unique addresses used to identify them.

JavaScript Object Notation (JSON)

Because of the JSON structure, this data can be parsed and each field can be accessed based on the keys. Since it is treated as an object, one only needs to access the "eventType" attribute on that object to get the attached value, in this case "Pass".

FFMPEG

Machine Learning

Supervised Learning
Unsupervised Learning
Reinforcement Learning
OpenCV
CVlib
YOLOv4

The model is trained on the training data, and gradually adjusts its weights using something like equation 2.1, based on how far from the labeled truth it is. MSE is a good algorithm to use if it is important that the model contains no outlier forecasts with significant errors.

Figure 2.3: Example machine learning model with steps and neurons

Related Work

Muithu

Bagadus

Bagadus extends this with a static camera array that provides a panorama and adds positional data through ZXY tracking sensors [6], as well as an analytical subsystem that allows the user to pan around the panorama. Mearka aims to have a similarly intuitive app that the user can use to tag events as they happen during the game.

Summary

In addition, Bagadus has a sensor subsystem that tracks the location of each player on the court using ZXY tracking sensors [6]. Muihtu has an intuitive app that allows the user to tag events during a game, as well as record gameplay with consumer cameras that sync with post-game tags.

Functional

Based on the goals stated in section 1.2, we have specified a set of requirements that the system must fulfill in order to achieve the goal. Data deletion: The user owns their data, so Mearka must make it easy to delete which user data the system has temporarily stored.

Non-functional

Output: The output of Mearka after tracking positions should be soccer metadata that allows the user to have information about the positions of players at any given time in the video. The output should be at least pixel positions in the frame as illustrated by figure 3.1, with x,y coordinates for player positions at time t.

End user interactions

Summary

This means that it should be possible to download useful football metadata from the system within 12 hours. One-click export: Because the system is easy to use, the resulting football metadata that Mearka generates should be easy to download once generated.

Figure 3.1: Example position detection x,y at time t.

Choosing A Camera For Development

Requirements

It should be possible to set up the camera, start recording and not worry about it until the session ends. Operation: The camera must be easy to use and operate, regardless of the user's technical level.

Options

However, an advertised battery life of 82 minutes is less than the required minimum of 105 minutes of battery life. The regular 4K sensor has an advertised battery life of 75 minutes, so it doesn't count.

Mearka-App

System Design

When the Mearka app starts recording, it sends a notification to the backend that it wants to start recording. If the backend is working and responding as expected, the Mearka app allows the user to start tagging.

User Interface

Pressing Stop Recording notifies the backend that the session has ended and prompts the backend to send any soccer metadata it has, tags and otherwise, back to the Mearka app. Upon receiving the metadata, the Mearka app prompts the user to share it wherever needed.

Mearka Web-Interface

User Interface

This thesis focuses on the position detection functionality, and therefore the essential part of the user interface shown in Figure 4.4 is the button to upload a video. When the button is pressed, a dialog box opens and the user can select one or more videos to upload, as Figure 4.5 illustrates.

Backend

REST API
Soccer Metadata
Mearka-app
Mearka Web-interface
Position Detection Component Communication

This folder is shared between the backend server and Mearka's position detection component. After the position detection component finishes detecting positions, it sends the positions file to the backend.

Position Detection Component

Concatinate Video

Once this file is created, FFMPEG is run with the appropriate command on the operating system and concatenates the videos listed in the file into a new video stored in the UUID folder. For this script to be useful, the videos must be named in ascending, alphabetical or numerical order.

Position Detection

For each frame, the time offset in the video is calculated and added together with a list of pixel positions. CVlib returns a list of labels describing what it found, a list of pixel positions, and a list of confidence levels.

Position Detection Component Server

This information is used to calculate the offset in the video where the positions are found, using equation 4.1. The tags are iterated, and if it has located people, those positions are stored in a position object along with the time offset in the video.

Summary

There are two separate REST APIs on the back end, one that communicates with the front end (2.1), as well as one for internal communication with the position detection component (2.2). Flask [69] is used to run the position detection server, implementing a small REST API (3.1) that enables communication with the backend.

Web

Api calls

If not, it obtains the UUID by requesting it from the backend via a GET request to the "/utils/get-uuid" endpoint. When the videos are in the backend, the backend makes sure that the multiple video segments are connected into one and starts detecting the position.

Mearka-App

When you upload a video or several video segments to the backend, the frontend counts how many files are being sent. Once the position detection is complete, it is possible to download the available football metadata from the back end.

Backend

Concatenated video is done when the user uploads multiple videos from the front end to the back end. The backend receives the video and stores it on a storage volume shared between the backend and the position detection component.

Position Detection Component

Concatenate video

Filtering is done by iterating through the list returned from the OS library and adding each file with the extension ".mov" or ".mp4" to a new list. FFMPEG starts through its CLI to create a new video with the same extension (“mp4” or “.mov”), which contains the content of the videos sent by the user.

Position detection

This script takes a path to the directory containing the video files as a command line argument from the server. Once the video concatenation is complete, the new file name is sent back to the backend. section 3), even if the positions are pixel coordinates in the frame.

Summary

Battery life

A requirement for the camera is that it be able to record for more than 105 minutes. From the graph in Figure 6.3, it can be concluded that recording with the screen off will increase the overall recording time of the camera.

Figure 6.1: Recording times with screen on.

File size

If it is important to have the longest recording time possible, then “1080p25” is the best option. It doubles the resolution while still providing ample battery life and has the second best recording time to file size ratio.

Position detection

Test system
Resolution speed
Framerate Speed
Detection Accuracy

The result of this experiment is shown in Figure 6.6, and more detailed figures are listed in Table 6.4. The box plot in Figure 6.6 illustrates where the median, mean, and 25th and 75th percentiles lie.

Figure 6.6: Position detection time on 30s video

Metadata Size

Having opening and closing tags for each element in the metadata means more characters need to be stored to convey the same information, compared to JSON, which is illustrated in 5.8. The table illustrates the file size difference between JSON and XML of metadata created from the same 30 and 60 second video clips.

Speedup

Summary

Unlike the web interface, the Mearka app does not need to request a UUID from the backend when you start a recording. The Mearka app has separate endpoints in the backend and the backend assumes a new recording means a new match or training session.

Mearka Web-interface

There were two options to use the video player on the web interface: play video directly from the frontend or send the video from the backend to play in the web interface. A third option could be to play the files as a playlist, but this functionality is not implemented to reduce the complexity of the web interface.

Backend

All soccer metadata found on the backend is stored in memory and not in persistent storage. But all soccer metadata is lost if the backend shuts down or something happens.

Position Detection Component

It is possible to develop a football marking system based on inexpensive, common off-the-shelf components. Common off the shelf (COTS) components: Mearka needs to be as cheap as possible, so uses COTS components to be implemented.

Summary

Future Work

Streaming
Tracking
Extend Tagging Option
Translate pixel-positions to real world positions
Video queue
Possible Real-Time

Soccer video and player position dataset.” In: Proceedings of the 5th ACM Multimedia Systems Conference. Supervised Machine Learning: A Survey.” In: 2021 4th International Conference on Advanced Communication Technologies and Networks (CommNet).

Figure 8.1: Example: Pixel-coordinate translated to field-positions

Example JSON

For-loop that prints the event type

Example machine learning model with steps and neurons

Example position detection x,y at time t

Example of positions at time t

Mearka data and communication illustration

App flowchart

Mearka-app UI overview

The backend and position sensing component have a reachable REST API that can be reached by other components in the system. In case the user uploads multiple files, the backend requires the position detection component to merge the videos.

Web UI

Web UI - upload multiple files

Web UI - confirm send to backend

Web UI - extracting-positions

Web UI - remove data or export metadata

Backend API endpoints

Data flow when using the Position Detection component

Component system overview

React typescript component example

Use a React Typescript component

React Native example

Setup backend endpoints using Gin for Golang

Example: soccer metadata used to know positions over time. 60

List of Python libraries used

Machine Learning endpoints

Example positional object for one offset

Example list of positional objects

Example JSON metadata

Recording times with screen on

Recording times with screen off

File sizes between resolutions and framerates

Recording time to file size ratio

Position detection time on 30s video

Accuracy example

Position detection accuracy

Alternative camera angle

Number of miss classification within six frames

Example XML Metadata

Example: Pixel-coordinate translated to field-positions

Progressive image

Interlaced video overview

Settings combination for camera and record options

Resolution pixel differences

Example: compare total pixels/second for 1512p25 and 1080p60. 72

Time difference between 25 and 60 FPS

Resolution accuracy

Size difference between JSON and XML