Dataset description - PhD Dissertation - Department of Mathematics

We start by providing a practical dataset description and afterwards provide a formal description. The purpose of the formal description is straightforward, it provides an abstract formulation of the framework, containing only the nec- essary assumptions in order to simplify and clarify the structural assumptions on the dataset. The aim is to make it easy to understand and replicate our results.

6.2.1 Specific dataset description

Our dataset consists of electronic health records (EHRs) from 169 patients.

These were collected at Diagnostisk Center (DC), Regionshospitalet Silkeborg, the diagnostic unit at the regional hospital in Silkeborg, Denmark. Our patients are typically chronically ill and often suffer from chronic pulmonary disease and several other ailments, including high blood pressure and high cholesterol – and are often under treatment for these. It is thus important to perceive these patients as being multi-sick, e.g. suffering from several diseases/ailments con- currently, complicating coherent symptoms and successful diagnosis. Adding to this, doctors personalize treatment with their professional, but subjective, judgment which further increase data heterogeneity. For each patient, the EHR is comprised of entries from a number of diverse source (e.g. blood sample, general practitioner visits, medications, surgeries) and as such the data appears highly heterogeneous.

Patient ID Date (YYYY-MM-DD) Data source Eventname

6 2010-12-18 Pharmacy Mandolgin

6 2014-11-29 Blood Sample P-Natrium

15 2015-09-05 Procedure EKG

39 2016-09-02 Pharmacy Hjerdyl

... ... ... ...

Table 6.1:EHR sample entries.

The records are divided into three groups according to final diagnosis from DC, namely lung cancer, colon cancer and arthritis. In this setting, the arthritis patients act as a control group. EHR entries from two years prior to DC referral and until seven days prior to DC referral are included to emulate the diagnostic time frame of a general practitioner. The EHR entries consists of the following variables: anonymous patient ID, data type, event and event date as illustrated in Table 6.1. The data type variable reflects the database from which the event in the EHR was pulled from. This could for example be pharmaceutical database or blood sample database. We note many events occur in batches - for example a single blood sample may be used to perform eight blood tests which results in eight entries in the electronic health record.

The timestamps in each patient record have been pushed by a random time to de-identify the data, but preserve the sequential and temporal structure. A number of summary statistics for the dataset is presented in Table 6.2-6.4.

Level Feature Group Count

Patient-statistics No. of patients 168

Arthritis 74

Colon cancer 38 Lung cancer 56 Event-level statistics No. of. unique events 1141

Most common event count 4441

Average event count 82

Table 6.2:SSI dataset statistics.

Feature Group Value

Total no. 93405

Arthritis 36088 Colon cancer 20799 Lung cancer 36518

Max. no. in one EHR 4021

Min. no. in one EHR 36

Average no. of entries 556

Arthritis 488

Colon cancer 547 Lung cancer 652 Table 6.3:Entry-level statistics.

6.2.2 Formal dataset description

Formally, the dataset can be described in the following way. We consider an unordered collection of sequences e.g.

S= [s1, s2, . . . , sp],

for somep∈N. Each sequences_j is an ordered set, consisting of items (i.e.

entry name or symbol) and is denoted by s_j=

i1, i2, . . . , i_n_j ,

Data source No. of events No. of. entries

Total 1141 94305

Blood sample 320 40883

GP-related activity 174 19766

Pharmacy/Prescription 436 16765

Procedure 177 8473

Radiology 30 378

Hospitalization 2 127

Ambulant 1 1145

Unknown 1 5868

Table 6.4:Data source distribution.

for somenj∈Nandij∈ I, whereI denotes the set of alluniqueitems in the databaseSgiven by

I B ni

∃j∈ {1,2, . . . , p}:i∈sj

o. Note that an item belongs to a sequence s=

i1, i2, . . . , i_n_s

(consisting of n_s items) if

i∈s ⇐⇒ ∃k∈ {1,2, . . . , ns}:i=ik.

The definition ofI allows us to pair each symbol with a unique numerical symbol ID in the natural numbers. Thus henceforth an item or event may also refer to its corresponding ID. The items and their order define the sequence and hence the sequences

saB(i1, i2, i3), sbB(i2, i1, i3)

are not equal, provided that i1 ,i2. Examples of this structure could be a corpora of documents (sequences) with items being words, or a database of electronic health records with recorded entry names being items. In the context of Section 6.2.1, the collection of sequences correspond to the datasets of 169 electronic health records (sequences), consisting of events (items) ordered by their time & date. The underlying presumption is that the sequential structure and local context (as defined by the ordering) defines the purpose and meaning of each item – in natural language processing this is called the Distributional Hypothesis [9]. Given Table 6.1, the sequence for patient 6, start as

s6= (Mandolgin,P-Natrium, . . .).

Assuming that Mandolgin, respectively P-Natrium, are given the event ID 1, respectively event ID 2, the sequence would start as

s₆= (1,2, . . .).

As added auxiliary information in our dataset, we know the data source for each item/event. To formalize this, let DS denote the space of possible data source (in our case 8 different data sources as shown in Figure 6.4) and observe that

∀i∃!d∈DS : data source ofiisd,

where ∃! d denotes there exists a unique d. To ease the notation, we will Section 6.4.3 simply refer todas DS(i). Additionally, for each sequences∈ S we know the resulting diagnosis Diag(s)∈ {arthritis, colon cancer, lung cancer} which is relevant in Section 6.4.5.

Exactly how we utilize the sequential structure in Section 6.2.2 is a mod- eling question, one possible choice is the embeddings fromWord2vec, [16].

Embeddings arises in natural language processing and concern encodings, e.g. mathematical representations of words (or items). The roughest possible encoding is the one-hot encoding where each unique symbols is represented by a|I |-dimensional standard unit vector, i.e. letydenote an item with item IDk

y= (0,0, . . . ,0,1

↑ kth index

,0, . . . ,0)∈R^{|I |}.

In natural language processing, this illustrates how we would understand words, if we were only able to form sentences consisting of a single word with no relational structure surrounding it (i.e. a sentence of multiple words).

6.2.3 Workflow description

The ecosystem, or workflow, for the data analysis is shown in Figure 6.1. It illustrates how we first clean the raw records through cut-off application (masking each event which occurs below a certain threshold to a dummy value 0), then applyword2vecSkip-Gram to obtain event embeddings. These embeddings are then fed into some classification tasks and the performance can be measured. The main point of embeddings is that they are trained prior to training the classifier, allowing a stronger starting point than inputs which have not been trained prior to classification.

No documento PhD Dissertation - Department of Mathematics (páginas 110-113)