Contents
3.2 Data Preprocessing
3.2.2 Instance Labelling Variable
The instance labelling variable gives a unique label to each assessment in order to then be used to aggregate assessments with the same label into instances. A label would ideally correspond to the exact date in which an assessment was performed, such that instances would correspond to a snapshot of a patient in a precise moment. However, this is unrealistic to the presented study since the scheduled exams cannot be performed in one single day, as it would overburden the patients, being often performed within a timespan of weeks. The exam schedule of the ADNI programs the approximated dates on which to perform each date-specific battery of examinations and gives a timespan of 2 weeks to perform all the examinations for that specific visit. However, some data features from a specific assessment were gathered over more than 2 weeks (1.22±2.65 weeks), probably due to patient’s unavailability.
The exams performed in a scheduled time window are tagged with the same visit code (viscode) (Table3.3), informing about the battery of tests on which that specific exam is included.
Table 3.3:Possible Visit Codes labels and respective description.
As such, we chose this code as the label for each instance. It is important to notice that data has a visit code already assigned to each entry, and therefore this could be used directly. However, the rule to assign an exam to an instance must be the exam date and upon data analysis, we discovered some anomalous cases (Fig. 3.4), showing a variability between exam dates with the same visit code and from the same patient that could not be ignored. To bypass those inconsistencies, we decided to continue using the visit code as the instance labelling variable but with some corrections specifically based on the exam date, by considering only visit codes limited to a given temporal window.
Figure 3.4:Example of an anomalous assigned visit code. Patient 51 has the first assessments in December 2005, in which was its baseline. The m06 assessments followed approximately 6 months after, in May 2006. However, it is observed some baseline assessments dating September 2011, a 6 year difference from the other assessments.
Checking the registry of patient 51 it is possible to notice that what was perceived as baseline in the dataset was in fact the visit m72. The problem occurred from the transition to ADNI2.
At this point, we chose a temporal limit to decide whether an assessment is or not from a given visit. Short time windows lead to merging into instances only assessments with the same visit code and dates very close to each other, possibly dismissing more data, while wide time windows include a greater number of assessments with the same viscode but introduce error from exams with irregular dates.
The taken approach was based on the schedule defined in the ADNI protocol, in which visits are performed every 6 months, until reaching the 24th month, where it starts to be just annually visits with
intermediary phone review assessments. To maximize the data quantity while still maintaining coher- ence with the ADNI study schedule, we decided to aggregate assessments limited to a maximum time difference of 6 months.
Particularly, the visit codes correction classifies each data entry depending on the time difference with the date associated to that visit (pivot date), extracted from the file Registry. Therefore, we attached a visit code regardless of the one already assigned to the data entries, relying on just the exam date and the information on Registry. For each data entry, the registry date closest to that entry’s exam date needs to be found, and if the difference between them is below 3 months, that entry is classified with the visit code of that pivot date (e.g. Fig. 3.5). If two exams of the same type of assessment are assigned to the same visit code, the exam which is closest to the pivot date is picked in detriment of the other.
Figure 3.5:Function of assigned visit code given an exam date for an exemplifying set of pivots dates. The registry file includes 5 visits, baseline, “m06”, “m12”, “m18” and “m24” and the respective registry dates which are denoted by the bullets under the time axis. As can be seen, the registry dates may not correspond to exactly 6, 12, 18 or 24 months after baseline. Here, for instance, it is observed that “m12” is slightly earlier than 12 months and “m18” and
“m24” are slightly later. This leads to a shortening of the 3 months limit in the case of the “m12” and “m06”. The exam dates which have no pivot date from less than 3 months will be discarded as missing values (NA).
As it is depicted in Table3.3, some visit codes cannot be included in the dataset since they do not fit in the formulations made so far. This include the viscodes of the screening visits, along with visits labbeled as ‘no visit defined’ and phone review assessments (m30, m42, etc.).
Specifically, screening cannot be assumed as an independent visit code because it is not sufficiently separated from other visits. Screening is performed up to about a month before the baseline visit but can also be carried out in between other visits, due to change of ADNI phase or for verification. In the cases where it is not tagged as a screening fail, the screening gives extra information from a set of variables that can be used later in the classifier.
Consequently, we did not removed directly the unwanted viscodes from the preprocessing phase.
Instead, we implemented in the viscode correction routine, the possibility to incorporate those assessment values, by removing registry pivot dates with those viscodes. In this way, the visits with less than 3 months of the assessments with unwanted viscodes are sought. Unwanted viscodes assessments are assigned to a approved viscode if in the 3 month range.
Regarding m03 visits, we opted to make the same approach as the screening visits. Visits on the 3rd month only occur in ADNI GO and 2 and correspond to a battery of purely imaging exams. Therefore, m03 instances would have every feature as a missing value except for the imaging set. Additionally, it is incoherent to have an instance representing a timespan of 6 months while others represent 3 months’
windows. Therefore, m03 is ignored from the set of registry dates and the m03 exams will be assigned to a visit code, according to the 3 months threshold.
The resulting overall picture of corrected visit codes is depicted in Fig. 3.7. An example of corrected viscodes is represented in Fig.3.6.
Figure 3.6:Example of 3 different visit code corrections. ’VISCODE.pre’ is the viscode before the correction. Patient 51 (represented in Fig. 3.4) had some inconsistencies in bl visits, which were solved by assigning the closest visit code m72. The other two examples, subjects 47 and 61, had assessments overly separated from the register date.
Here we can observe that m30 is ignored and some values, as is the case for the assessments of patient 61 performed in August 2008, will be dismissed.