A case study on environmental sustainability: a study of the trophic changes in fish species as a result of the damming of rivers through clustering analysis

(1)

Contents lists available atScienceDirect

Computers & Industrial Engineering

journal homepage:www.elsevier.com/locate/caie

A case study on environmental sustainability: A study of the trophic changes

in

ﬁsh species as a result of the damming of rivers through clustering

analysis

Ricardo de Almeida

a

, Maria Teresinha Arns Steiner

a,⁎

, Leandro dos Santos Coelho

a

,

Cláudia Aparecida Cavalheiro Francisco

b

, Pedro José Steiner Neto

c

a_{Industrial and Systems Engineering Graduate Program, Pontifícia Universidade Católica do Paraná (PUCPR), Curitiba, PR, Brazil} b_{Industrial Engineering Graduate Program, Universidade Federal do Rio Grande do Norte (UFRN), Natal, RN, Brazil}

c_{Business Graduate Program, Universidade Positivo (UP), Curitiba, PR, Brazil}

A R T I C L E I N F O Keywords: Environment sustainability Clustering analysis Trophic categories ofﬁsh River phase Reservoir phase A B S T R A C T

The damming of rivers has been long used for electricity generation and is among the most used sources of renewable energy. However, building dams may cause several transformations in the environment, being changes infish assemblage one important consequence, especially when there are communities that rely on fishing as a source of income. The aim of the present study is to analyze the trophic changes in fish species caused by the damming of rivers. Trophic data (stomach content) onfish from the Corumbá Reservoir in the State of Goiás, Brazil, which was collected prior (River phase) and after (Reservoir phase) the building of the dam, were used to carry out the study using Clustering techniques. The methodology used was composed of data ex-ploratory analysis, followed by the assignment of clusters for the later implementation of knowledge. The de-finition of the number of clusters, the usage of different types of clustering distances and the use validation indexes are discussed. A modified version of the Teitz & Bart algorithm, originally used for facilities location problems, was introduced for Clustering problems and the results were compared with three well-known Clustering algorithms from literature. The clustering approaches were applied separately in both phases and in both cases,five large clusters of fish were determined: generalists, insectivores, herbivores, piscivores, and detritivores. With this evaluation, could be used by biologists in order to evaluate environmental effects and managers can develop strategies to address the social and economic impacts caused to the communities that depend onfishing.

1. Introduction

The landflooded into a lake to serve as a reservoir for a dam causes huge transformation of nature: climate changes,fish species disappear, animals run into the dry ground, high-quality trees become rotten wood. On top of that, a social problem arises: thousands of people are forced to leave their homes in order to restart life someplace elsewhere. As a side effect for the damming of rivers, changes may occur in the structure offish communities, with the proliferation of sedentary spe-cies and the reduction or elimination of migratory spespe-cies. There is an increase in terrestrial biomass that will be incorporated into the re-servoir (grass, trees, foliage), which reduces the dissolved oxygen availability due to decomposition of the organic material. Therefore, this will drastically alter the environment and interfere in the living

conditions of thefish, especially their food. A high rate of mortality is observed during the damming period, andfish species unable to adjust to the environmental changes may be replaced by others more adapted to the new environment (Cunico & Agostinho, 2006). Additionally, the consumer market reaction to species offer changes are unknown; therefore,fish processing companies may be driven out of business and localfishermen may have to look for alternative sources of income.

Knowledge of ﬁsh assemblages also provides grounding in the conditions of the aquatic ecosystems, i.e., they are excellent bio-in-dicators and provide information on the quality of the water (Ibarra, Gevrey, Park, Lim, & Lek, 2003). The larger the amount of information on the relationships in the environment, the more adequately it will be handled, with a view to better conservation and monitoring of the en-vironment, and for the population that depends on it. The relationships

https://doi.org/10.1016/j.cie.2018.09.032

⁎_{Corresponding author.}

E-mail addresses:[email protected],[email protected](M.T.A. Steiner).

Computers & Industrial Engineering 135 (2019) 1239–1252

Available online 18 September 2018

(2)

betweenﬁsh assemblages and the environment are highly complex and provide non-linear and high dimensional data, which makes them hard to interpret.

The aim of the present work is to study an environmental sustain-ability problem, more specifically, the trophic changes in fish species as a result of the damming of rivers using clustering analysis. The grouping (clustering) of fish species was done before and after the damming of the river through the analysis of their stomach content. The study was carried out using a specific case occurred in the Corumbá River located in the State of Goiás, which belongs to the Paraná River Basin, in Brazil. For that, it was used a methodology composed of three phases, data exploratory analysis, a clustering analysis and the results interpretation. With this evaluation, biologists will be able to make decisions, in environmental terms, due to alterations in the environ-ment, and/or in social terms, due to the communities that depend on fishing.

In the exploratory analysis, the data were preprocessed and the definition of the most adequate number of clusters was carried out using L-method (Salvador & Chan, 2005). An extensive cluster analysis was then performed applying four different clustering techniques: k-means (Forgy, 1965; Jain, 2010; MacQueen, 1967); Teitz & Bart (Teitz & Bart, 1968) adapted to clustering (T&BaC); Differential Evolution (DE;Storn and Price, 1997; Hruschka, Campello, Freitas, & Carvalho, 2009; Kuila and Jana, 2014; Xiang, Zhu, Ma, Meng, & An, 2015); and Potential-based Hierarchical Agglomerative clustering (PHA;Shi, Yang, Wang, & Zheng, 2002), comparatively. Each clustering technique was further varied by applyingfive different distance measures: Euclidean; Chebyshev; Mahalanobis; Cosine; and Hamming (Fielding, 2007; Theodoridis and Koutrumbas, 2009). The quality of the formed clusters by each technique was evaluated by the following Clustering Validity Indices (CVI): the Dunn index (DI; Dunn, 1973), Calinski-Harabasz index (CH;Calinski and Harabasz, 1974), Davies-Bouldin (DB;Davies & Boulding, 1979) index, and Silhouette index (Rousseeuw, 1987).

Grouping or Clustering analysis seeks groups of objects so that ob-jects belonging to the same group are as similar as possible and, at the same time, dissimilar to objects belonging to other groups (Johnson & Wichern, 2005). Clustering similarities and dissimilarities are based on distance measures among the objects. This subject has been reviewed by several authors who have introduced the fundamental concepts of clustering (Halkidi, Batistakis, & Vazirgiannis, 2001) and addressed the importance of the clustering process regarding the quality assessment of the clustering results. Clustering analysis is an important subject in statistics, computer science, and machine learning (Xu & Wunsch II, 2005).José-García and Gómez-Flores (2016)discussed the fundamental problem in clustering analysis, which is to determine the best estimate of the number of clusters. The authors highlight the complexity of this task, especially when the data have many dimensions, when clusters diﬀer widely in shape, size, and density, and when overlapping points among groups are observed.

The method presented in this study can be used not only in the evaluation of the socioeconomic impact of Damming rivers but also extended to other similar studies where a sequence of steps for deﬁning a clustering method should be deﬁned. Further analysis of the groups formed should be used to establish transformations occurred in the group's characteristics by comparing prior and posterior traits in po-pulations subject to interventions. By comparing those transformations a researcher could assess the impacts of the intervention.

This study also can beneﬁt planners, that based on the presented results can elaborate strategic actions to deal not only with the en-vironmental impact of damming rivers but also establishing policies to address social and economic issues related to the communities that rely on the river as a source of income. The importance of diﬀerent sus-tainability dimensions (Economic, Environmental, and Social) is high-lighted in the work of Sahebjamnia, Fard, and Hajiaghaei-Keshteli (2018), in which a sustainable tire closed-loop supply chain network was developed. Economic, social and environmental aspects are also

considered in the work ofHajiaghaei-Keshteli and Fard (2018), where a mixed integer nonlinear programming model was developed to for-mulate a multi-objective sustainable closed-loop supply chain network. The remainder of this paper is organized as follows. Section2 pre-sents the problems regarding the Corumbá Reservoir and its data. Section 3contains some correlated works which address Clustering Analysis. Section 4 contains the methodology used, which includes crucial issues, such as deﬁning the number of clusters, the distance measures and the clustering validity indices used. In Section5, the re-sults are presented and discussed. Finally, Section6contains the con-clusions and suggestions for future studies.

2. Literature review

There is much research on sustainability problems that rely on op-timization techniques. Kannegiesser, Günther, and Autenrieb (2015), for example, used a linear optimization model, called“Minimize the Time-to-Sustainability” (TTS), based on the triple bottom line sustain-ability concept, in order to minimize the time to reach the target values related to the sustainability objectives. Three variants of this approach were evaluated in an automotive industry.Babazadeh, Razmi, Pishvaee, and Rabbani (2017)presented a multi-objective programming model in order to design a second-generation biodiesel supply chain network under risk. The authors solved the proposed model through a hybrid solution approach which was capable tofind efficient solutions from the Pareto-optimal set. The procedures were evaluated in a real case study in Iran.Fard and Hajiaghaei-Keshteli (2018)addressed a tri-level location-allocation design problem considering, simultaneously, the forward and reverse network: the problem was formulated on the static Stackelberg game between the Distribution Centers, Customer Zones and Recover Centers. For that, the authors used Variable Neighborhood Search (VNS), Tabu Search (TS), Particle Swarm Optimization (PSO), Keshtel Algorithm (KA) and Water Wave Optimization (WWO). The efficiency of the algorithms was validated through a real case study in a glass industry.Arampantzi and Minis (2017)proposed a Multi-objective Mixed Integer Linear Programming (MMILP) model, which considers decisions about designing or re-designing high performance, sustain-able supply chains. In order to solve the model, the authors employed goal programming and theε-constraint method to get efficient trade-offs among the three objectives (investment, operational, and emissions costs). The model was applied to a large case study of a global manu-facturer of commercial refrigerators.

The Clustering Analysis (CA) techniques have been applied by several authors through a great diversity of procedures. Such ap-proaches reported here are presented in chronological order inTable 1. A Hybrid Scenario Cluster Decomposition (HSCD) was presented by Zanjani, Bajgiran, and Nourelfath (2016), which decomposed the ori-ginal scenario into smaller sub-trees, for solving a Multistage Stochastic Mixed-Integer Programming mathematical model. The authors tested their approach on a set of realistic-scale test cases, which showed its eﬃciency in terms of solution quality and computational time. A clus-tering model which uses available dissimilarity matrices to identify individual clustering objects in a similar way was used bySanti, Aloise, and Blanchard (2016). The model was solved by the meta-heuristic algorithm VNS, and an empirical application to the perception of cho-colate candy is shown. A methodology to help manage investment in green-supplier development and also in business-supplier-development practices, which requires managing of large sets of data, was introduced byBai, Dhavale, and Sarkis (2016). The methodology combines rough set theory and Fuzzy Clustering Means approaches.Dietrich, Popp, and Lotze-Campen (2013) presented an aggregation approach through clustering methods for agricultural land-use models. The authors ana-lyzed that clustering reduces the loss of information due to aggregation by choosing an appropriate aggregation pattern.

CA has also been successfully used in Biology applications.Wiwie, Baumbach, and Röttger (2015)assessed 13 well-known methods using

(3)

24 datasets ranging from gene expression to protein domains. The use of the Capacitated Clustering Problem (CCP) in the Sibling Re-construction Problem (SRP) was investigated by Chou, Chaovalitwongse, Berger-Wolf, DasGupta, and Ashey (2012). Azzag, Venturini, Oliver, and Cuinot (2007)applied the approach in an ana-lysis of healthy human skin, in online mining of website usage, and in the automatic construction of web portals. The Kohonen Self-Orga-nizing Map (SOM) clustering algorithm and k-means were compared in order to estimate the invasiveness of insect pest species byWatts and Worner (2009).

Some authors have been used CA in the Data Envelopment Analysis (DEA) concepts.Dai and Kuosmanen (2014)proposed a new approach that applied clustering methods to identify groups of DMUs that are similar in their input-output or other characteristics; the authors pre-sented an application for the regulation of electricity distribution net-works in Finland. Bi, Song, and Wu (2014) used a non-radial DEA framework (Slacks-Based Measure, SBM) to classify the environmental performance of Chinese industry, forming a benchmark-based clus-tering approach.

Rabello, Mauri, Ribeiro, and Lorena (2014)proposed the use of a Clustering Search metaheuristic as an alternative to solve the PFCLP (Point-Feature Cartographic Label Placement) problem. Alhourani (2013) and Oliveira, Ribeiro, and Seok (2009) used clustering algo-rithms to design cellular manufacturing systems, based on similarity coeﬃcients that calculate the similarity between machine groups. Chuang, Lee, and Lai (2012)proposed a two-stage Clustering-Assign-ment Problem Model (CAPM) for customized-orders picking, applied the procedure of a drug distribution center.Abbas (2008) and Mingoti and Lima (2006)compared diﬀerent data clustering algorithms.Abbas (2008)compared k-means, hierarchical clustering algorithm, SOM and Expectation Maximization algorithm, according to the following fac-tors: number of clusters, size of the dataset, type of software and type of the dataset, while Mingoti and Lima (2006) compared the SOM, k-means and Fuzzy c-k-means methods.Srinivasan and Moon (1999) in-troduced a clustering methodology for supporting inventory manage-ment in supply chain networks.

Qu, Nie, Li, Thurer, and Huang (2017)considered a high number of enterprises/suppliers using a general Assembly Cluster Supply Chain

Conﬁguration (ACSCC) model and, in order to reduce the complexity of the problem, the authors presented a decomposition-based solution method called Augmented Lagrangian Coordination (ALC). Nilashi, Bagherifard, Rahmani, and Rafe (2017) proposed a method based on multi-criteria CF (Collaborative Filtering) to enhance the predictive accuracy of recommender systems in the tourism domain using clus-tering, dimensionality reduction and prediction methods.Franco and Steiner (2018)analyzed the possibility of installing solar power plants in unproductive areas; for that, a new hybrid fuzzy c-means (HFCM) algorithm, initialized, comparatively, by three metaheuristics: Di ﬀer-ential Evolution (DE), Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) were applied.

The main contribution of this work, comparatively to the works summarised inTable 1, is the analysis of an environmental sustain-ability problem, more specifically, the trophic changes in fish species as result of damming of rivers, as well as in the methodology, which, as well as being complete and innovative, is explained in detail throughout the text. The methodology was composed by three phases, data exploratory analysis, a clustering analysis and the results inter-pretation. In the exploratory analysis, the number of clusters was de-fined using L-method, which is an effective technique but not well ex-plored. In the clustering analysis, we emphasize the use of Teitz & Bart (T&B) algorithm, which was adapted in this paper for clustering tasks (T&BaC). The usedfive different distances to measure proximity/simi-larity between data points, in a comparative way, also deserve to be highlighted. Through this study (and other similar ones), biologists will be able to make decisions, in environmental and/or in social terms.

3. Description of the problem

The large rivers of the Paraná Basin in Brazil with hydroelectric potential are already being used at almost maximum electricity gen-eration capacity. As a result, these rivers have been compartmentalized by dams, with consequent formation of large reservoirs. The rivers with the greatest hydroelectric potential are generally those that run through valleys and whose levels vary considerably. According to Oliveira, Goulart, and Minte-Vera (2004), after a reservoir is formed, manyﬁsh species disappear or, if they resist, the remainder of their once Table 1

Selected publication on the clustering analysis.

Publication Problem Methods

Srinivasan and Moon (1999) Inventory management Clustering methodology

Mingoti and Lima (2006) 2530 simulated data sets SOM; k-means and FCM; HCA

Azzag et al. (2007) Healthy human skin; mining of websites; construction of portal sites AntTree; k-means; ANTCLASS; AHC

Abbas (2008) Public data k-means; HCA; SOM and EMA

Oliveira et al. (2009) Intercellular movements A proposed MCF algorithm

Watts and Worner (2009) Invasiveness of insect pest species SOM; k-means

Chou et al. (2012) Sibling reconstruction GRASP

Chuang et al. (2012) Picking (drug distribution center) Two-stage CAPM

Dietrich et al. (2013) Agricultural land-use Aggregation approach

Alhourani (2013) Intercellular movements Multiple process routing

Bi et al. (2014) Environmental performance of Chinese industry DEA

Dai and Kuosmanen (2014) Regulation of electricity distribution networks DEA

Rabello et al. (2014) Point-Feature Cartographic Label Placement ECS (SA; GRASP; TS)

Wiwie et al. (2015) 24 datasets (gene expression– protein domains) 13 well-known methods

Bai et al. (2016) Investments RS; FCM

Santi et al. (2016) Chocolate candy VNS

Zanjani et al. (2016) Supply chain tactical planning HSCD

Qu et al. (2017) Industrial enterprises/suppliers ALC

Nilashi et al. (2017) Tourism industry SOM; EMA

Franco and Steiner (2018) Solar energy facilities FCM; DE; GA; PSO

AHC, Ascending Hierarchical Clustering; ALC, Augmented Lagrangian Coordination; CA, Clustering Analysis; CAPM, Clustering-assignment Problem Model; CCP, Capacitated Clustering Problem; DE, Diﬀerential Evolution; DEA, Data Envelopment Analysis; ECS, Evolutionary Clustering Search; EMA, Expectation Maximization Algorithm; FCM, Fuzzy Clustering Means; GA, Genetic Algorithm; GRASP, Greedy Randomized Adaptive Search Procedure; HCA, Hierarchical Clustering Algorithms; HSCD, Hybrid Scenario Cluster Decomposition; MCF, Manufacturing Cell Formation; PSO, Particle Swarm Optimization; RS, Rough Sets; SA, Simulated Annealing; SOM, Self-organizing Map; TS, Tabu Search; VNS, Variable Neighborhood Search.

(4)

abundant population is mainly conﬁned to the tributaries.

The resulting changes caused toﬁsh fauna by large dams are widely documented (Agostinho, Gomes, & Pelicice, 2007). The species that successfully colonize these environments are mostly sedentary and generalist, while species that depend on lotic environments (rapids) are diminished in abundance, as is the case withﬁsh that migrate over a long distance.

The disappearance or reduction of large migrators, which reproduce in lotic environments, is one of the most noticeable consequences in rivers where a series of dams are built. The abundance of large mi-grators in the dammed area depends on the existence of large river stretches above the reservoir or large undammed tributaries (Agostinho et al., 2007). Some small and medium-sized species also have to adapt and reproduce in the rapids, and may see their populations reduced in dammed areas. Many species, on the other hand, reproduce indis-criminately in lotic and lentic environments, and these are the most successful in dammed environments.

Given the need to quantify the impacts of these works on the aquatic biota, this study proposes the evaluation of theﬁsh assemblage struc-ture in the river phase (before the dam is built) and the reservoir phase (after the dam is built) at the Corumbá Reservoir in Goiás State, Brazil. With this evaluation, biologists will be able to make decisions, in en-vironmental and social terms, due to the alterations.

3.1. Data collection

To conduct this study, data were used from the project on “Ichthyologic Studies on the Area Aﬀected by the Corumbá Hydroelectric Plant” by the Research Nucleus on Limnology, Ichthyology, and Aquaculture (Nupélia) of Universidade Estadual de Maringá (UEM) in conjunction with FURNAS Centrais Elétricas S.A., Brazil.

The data were collected in the region of the Corumbá Reservoir between April 1996 and February 2000. Corumbá River is a tributary of the right bank of the Paranaíba River and is part of the system of re-servoirs of the Paraná River drainage basin. The timespan included different phases of the formation of the reservoir. The pre-dam phase (the months before the closing of the dam gates) ran from April to August 1996. This was followed by thefilling, which lasted for the next six months, from September 1996 to February 1997, and the reservoir (from March 1997 to February 2000). In the three phases, thefish were sampled at the same 12 stations established in different environments. To analyze the main species, the sampling stations were divided into five environments: Downstream, located below the dam; Lacustrine, re-gion deeper inside the reservoir; Transition, representing the transition zone between the lotic and lentic environments; Upstream, free stretch of the Corumbá River above the reservoir; and Pacu and Fish, both lo-cated in the tributaries.

In Table A (Columns A to D), attachment, 52 species of fish are listed, with their quantities and corresponding percentages sampled in river and reservoir phases. For example, species S9, denominated B. Stramine, had 266 individuals in the river phase, representing 14% of the sample (total of 1895fish) and 96 fishes in the reservoir phase, representing 1.7% of the sample (total of 5541fishes). The percentage distributions of each species in both river and reservoir phases are il-lustrated inFig. 1.

The 11 data attributes, defined by biologists from FURNAS Centrais Elétricas S.A., correspond to the percentage volume of different types of food found in thefish stomachs, as follows: micro-crustacean (Micro); aquatic insects (Aquins); terrestrial insects (Terins); other aquatic in-vertebrates (Otaqinv); other terrestrial inin-vertebrates (Otterinv); fish (Fish); other terrestrial vertebrates (Ottervert); algae (Algae); aquatic vegetation (Aqveg); terrestrial vegetation (Terveg); debris/sediment (Debsed).

4. Methodology

The methodology used was composed of three phases, as presented inFig. 2: Data Exploratory Analysis, Clustering Analysis, and Knowl-edge application.

4.1. Data exploratory analysis

Initially, the data (instances) were collected as shown in Section3.1. The data cleaning and correlation of the data were then developed in order to gauge whether some of the attributes could be merged, we performed a pairwise correlation test. Variables highly correlated could be transformed in a single attribute and help the analysis and the per-formance of the algorithms.

The deﬁnition of the number of clusters is one of the most important parameters in clustering analysis and may be especially diﬃcult in high dimensional and noisy datasets. Choosing the number of clusters could result in segments that have meaningless information regarding the problem.

One popular approach used to define the best number of clusters is based onfinding the “knee of a curve”. The knee of a curve is defined as the point of maximum curvature in a graph of the number of clusters vs evaluation metric. Taking within-cluster wk distance as a metric of

cluster quality, when two similar clusters are merged, a slow growth in the sum of wk(W) distance is observed. On the other hand, when two

dissimilar clusters are merged, a rapid growth in W is observed, in-dicating that the new clusters are becoming less internally homo-geneous.

In order tofind the knee of the evaluation graph, we apply the L-method proposed by Salvador and Chan (2005). This method takes advantage of the fact that both the right and left sides of the evaluations graphs are often approximately linear. If one line isfitted to the right side and the other to the left, the intersection between the two lines will represent the region of the knee. The respective value in the x-axis is then used as the number of clusters since the knee region contains a balance of clusters that are both highly homogeneous and also dis-similar to each other. This method can be categorized as a variance-based approach (Mirkin, 2011) and was first proposed to deal with hierarchical clustering methods since the evaluation graph can be constructed with only one run of the algorithm.

To deﬁne the number of clusters we ran each combination of clus-tering algorithm with distance measure 10 times for each number of clusters, ranging from 1 to 30, and stored the value of W of the best run. This process generated information for evaluation graphs. The software used to compute the L-method is Matlab®, based on the code im-plemented byZagouras, Inman, and Coimbra (2014)and available on Matlab File Exchange).

4.2. Clustering techniques

In this paper we evaluated four clustering algorithms, which present high diversity approaches among themselves: (1) k-means algorithm, which is one of the most popular clustering technique and aims tofind the k partitions of a dataset that minimizes the within-cluster variance; (2) the PHA (Lu and Wan, 2013) relies on a potential model to build a hypothetical potentialfield for each data point that considers not only local but also global data distribution. The data points are hier-archically grouped based on similarity of potentialfields and distance between each other; (3) T&B heuristic algorithm adapted to clustering (T&BaC), which is originally used for the Facility Location Problem (FLP;Teitz & Bart, 1968); and (4) DE, which is an evolutionary algo-rithm introduced by Storn and Price, in 1997 (Kwedlo, 2011). Among these four techniques, we present the step-by-step of T&BaC heuristic algorithm.

(5)

4.2.1. Teitz & Bart algorithm adapted to clustering (T&BaC)

Let V be the set of all data points and C the set cluster seeds. Given the number of clusters p, and the distance dij, from the ith data point to

the jth candidate seed the transmission numberσ(C) can be obtained by summing up the distances of all data points in V to the closest seed in C, as shown in Eq. (1).

∑

= ∀ ∈ = σ C( ) min(d j C) i V ij 1 (1)

The T&BaC algorithm iterates through all seed candidates, searching for a set of seeds C that results in the lowestσ C( ). Thefirst set of seed candidates C is randomly chosen. Each point of the set (V-C) is analyzed by replacing each seed candidate, one at a time, and the re-placement that results in the maximum improvement, if any, is adopted to form a new set C. When all points in the set (V-C) are analyzed, the iteration isfinished. If a change in set C is observed, a new iteration is started using the new set C; otherwise, the points in (V-C) are assigned to the closest seed in C to create the clusters and the algorithmfinishes.

The following steps (1–7) summarize the T&BaC algorithm: (1) Compute the distance matrix between all data points in V; (2) Select a set C, with |C| = p to form an initial seed to form the

clusters and computeσ(C);

(3) Label all the data points vi∉ C as “not analyzed”;

(4) Label all seeds cj∉ C as “analyzed”;

(5) Evaluate“not analyzed” data points:

While there are“not analyzed” data points in (V-C):

- Select a“not analyzed” data point vi∈ (V – C), and compute the

diﬀerence Δijof the transmission number to all cj∈ C:

oΔij=σ C( )−σ C( ∪{ } { }vi−cj)

- Markvias“analyzed”;

- If max[Δij] > 0:

o Update set C: C←(C U {vi}– {cj});

(6) If after the execution of Step 5 there is a modiﬁcation in set C, return to Step 3. Otherwise, go to Step 7;

(7) Assign all data points in V to the closest seed cjof C. It will be a

heuristic solution for the clustering problem.

The process of analyzing data points is illustrated inFig. 3. For the Clustering Analysis, the k-means and T&BaC algorithms were coded by the authors using Matlab®. The authors also adapted a DE Matlab code, based on the code developed by Yarpiz® (www.yar-piz.com– 2015). It should be emphasized that the parameters used for the DE algorithm were the following: theﬁtness function considered was the total within-cluster distance (W); the stop criterion was 1000 iterations and the size of population was set to 50 individuals (set of clusters); the scale factor ranged from 0.2 to 0.8 and crossover prob-ability was 20%. The code used in this work for the PHA method is also available in the Matlab File Exchange (Lu & Wan, 2013).

4.3. Distance measures

Clustering distances measure the proximity between instances to form groups (Theodoridis & Koutrumbas, 2009). Selecting an appro-priate measure is important because, depending on the kind of proxi-mity measure, diﬀerent groupings can be created (Fielding, 2007). A proximity measure can be either a distance or a similarity (dissim-ilarity) between a pair of instances.

In this paper, we evaluatefive different distance measures, as al-ready mentioned: Euclidean, Chebyshev, Mahalanobis, Cosine, and Hamming. The reason for the use of several distance/similarity mea-sures is that each one is designed to cope with different data distribu-tions and levels of noise. This could not only help to determine clusters with different shapes but also increase the reliability of the results.

0 5 10 15 20 A.affin is A.bimacu l A.eigenm an A.fasciat u

A.ibitiens A.piracica A.scabripi B.natterer _B.strami

ne Br ycon asp 3 C. harold oi C. ih eri ngi C. mono culu C. paranaen C. zebra Co ryd or s p Eigen ma s p G.carap o G.ces arpin G.kn erii H.malabar i H.reg ani H ypo st o sp I.lab ros us L.amb lyrhy L.elon gatu L.fr ideri L.o ct ofas c L.reticula L.vi ttatus M.intermed M.platanu s M.tiete Metynn i sp Odon to s sp O.nilo ticu O.planalt P.argen tea P.corr usca P.fu r P.li neatus P.maculatu P.paran ens Pi mel od esp R. quel en S. cor umb S. in scu lp t S.macru rus S.maxillos S.n asu tu s S.s pil op le T.n ei vae Percentage (%) Species

River Phase Reservoir Phase

Fig. 1. Distribution of species in each phase (River and Reservoir).

(6)

4.4. Clustering validity indices (CVI)

In the cluster evaluation, diﬀerent aspects can be observed: de-termining if there is a non-random structure in data; comparing the results of clustering analysis with external results; evaluating how well the results of the clustering analysis adjust to the data without reference to the external information; comparison of the results from two dif-ferent sets of analysis of clusters to determine which one is better; and determining the“correct” number of clusters (Tan, Steinbach, & Kumar, 2005).

According toTan et al. (2005), the numerical measures applied to judge various aspects of cluster evaluation are classiﬁed into three types: the external indices that are used to measure to what extent the cluster labels correspond to the externally supplied class labels; the internal indices, used to measure how good the clustering structure is without any relationship with the external information; and the relative indices, used to compare two diﬀerent groups or clusters.

To evaluate the quality of the clusters we chose four common CVIs: the Dunn Index, Calinski-Harabasz (CH) index, Davies-Bouldin (DB) index, and Silhouette (Si) index. The Dunn index is a classical and widely used validity index proposed byDunn (1973)that considers not only the cohesion within-cluster but also the separation between clus-ters. In the CH index, the cohesion is estimated by the sum of the dis-tances of the patterns to their respective centroid, and separation is measured by the sum of the distances from each centroid to the global prototype (Calinski & Harabasz, 1974). The DB index estimates the cohesion by the mean distance of the objects to their respective cen-troid, and the separation is measured by the distance between centroids (Davies & Boulding, 1979). The Si index computes its cohesion measure

using the sum of the distances between all the points in the same cluster, and the separation is based on the nearest neighbor distance between points in diﬀerent groups (Rousseeuw, 1987). The functions used to calculate CVIs are built-in functions of Matlab®, except the Dunn index, for which a function available on Matlab File Exchange was used.

5. Results

In the next subsections, the results of the experiments are presented and discussed.

5.1. Data exploratory analysis

The data/instances (7436), with their 11 attributes, were cleaned. The totally null species were eliminated (in the river phase, six species were excluded: S13, S19, S30, S36, S39 and S43 and, in the reservoir phase, four species: S8, S10, S16 and S32). All the others were con-sidered, even those with low representation. From the results of pair-wise correlations, in both phases, we observed that the attributes are low correlated to each other and, consequently, all of them will be used in the clustering process.

The behavior of the within-cluster distance (W) related to the number of clusters is important to the use of L-method as a tool to determine the number of clusters. Fig. 4shows an example of such behavior for the k-means algorithm, using Euclidian distance in both river and reservoir phases. The sharp drop in W on the left side and the smooth decrease on the right side indicate that a good approximation of the adequate number of clusters can be obtained using L-method. Fig. 3. T&BaC search process.

(7)

InFig. 5, we present an example of the outputs of the L-method for k-means using Euclidean distance in the River phase. The point where the lower Root Mean Square Error (RMSE) is observed indicates the number of clusters that results in the best balance between within-cluster homogeneity and dissimilarity among within-clusters.

The results of L-method for all algorithms and their variations (diﬀerent distance measures) studied in this work are summarized in Tables 2 and 3, for the River and Reservoir phases, respectively. The tables show the number of clusters where the lowest RMSE was ob-served for each algorithm and its respective distance measure.

From the results presented inTables 2 and 3, obtained from the L-method we observed that for Euclidian, Chebyshev and Cosine dis-tances, the algorithms resulted in a similar number of clusters in both phases. A wider range was observed when Mahalanobis and Hamming

distances were applied. This is due to the behavior of W in respect to the number of clusters, which does not allow theﬁt of the right and left side lines toﬁnd the knee of the curve. This behavior is shown inFig. 6, for the k-means algorithm using Mahalanobis distance. Although this combination resulted in 6 clusters, this value is not adequate as an

Number of clusters 0 5 10 15 20 25 30 W 0 200 400 600 800 1000 1200 1400 River phase Number of clusters 0 5 10 15 20 25 30 W 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Reservoir phase

Fig. 4. Evaluation graph for River and Reservoir phases: W vs. the number of clusters.

Fig. 5. Graphical output of L-method for k-means using Euclidean W (River phase). Table 2

Results of the suggested number of clusters by the L-method in the River phase.

Euclidean Chebyshev Mahalanobis Cosine Hamming

k-means 6 5 6 5 12

T&BaC 6 6 6 5 4

DE 5 4 11 5 2

(8)

indication of the balance between within-cluster homogeneity and dissimilarity among clusters.

Considering the results of all algorithms with Euclidian, Chebyshev and Cosine distances, the average number of clusters were 5.1 in River phase and 4.8 in Reservoir phase. It indicates the existence of similar groups, according to the stomach content, in both phases. Therefore, the further analysis of species and their trophic niches will consider 5 diﬀerent clusters for both the River and Reservoir phases

5.2. Clustering analysis

In this section, the results of the clustering algorithms studied in this work and their variations are presented. First, the clusters formed by each algorithm are evaluated in terms of W and computational time. Each combination algorithm/distance was run 30 times, except PHA, which is a deterministic algorithm, and the mean, minimum, maximum and standard deviation of W were saved. To evaluate the eﬃciency of each algorithm, the computational time, measured in seconds, is also presented. Furthermore, unlike the other algorithms evaluated in this work, the PHA requires the computation of the dissimilarity matrix, and this computation is considered in the PHA computational time. The results of the River and Reservoir phases are presented inTables 4 and 5, respectively.

The results show that T&BaC algorithm proved to be an eﬀective method for clustering. For the problems studied in this paper, it out-performed the other algorithms, in both phases, in terms of W, except

when cosine distance is used, in which case DE resulted in the lowest average W. The T&BaC also showed robustness in the search process, regardless of the initial seeds, and were able to converge to the same solution in all experiments, except in River phase with Euclidean dis-tance. The second algorithm in terms of W was the DE; despite its good results, it showed slow convergence compared to other approaches. The k-means algorithm showed the best computational results in all ex-periments; however, k-means is very sensitive to initial seeds, which causes a high variance of W. The PHA also showed good computational performance, in the order of half of T&BaC, but in some cases, the average W was worse than k-means.

In this work, we further validate the resulted clusters using the CVIs described in section4.4. The CVIs were applied in the best solution of the 30 runs used to generateTables 4 and 5. The results are presented in Tables 6 and 7, for River and Reservoir phases, respectively. It is im-portant to note that the Dunn, CH, and Si are maximization indices, while the DB is a minimization index.

InTables 6 and 7, we also show the resulted number of clusters formed by each algorithm and its variation. When the number of clusters is lower than the specified value (5), it means the existence of empty clusters. For instance, the T&BaC algorithm using Mahalanobis distance formed 4 clusters, which means that during the search process, the best result was achieved when no data points were assigned to one seed. From the results ofTables 6 and 7, it is possible to observe that there is no consensus among CVIs, i.e. each CVI indicates a different combination of algorithm/distance as the best one. To overcome this issue, we constructed a unified index, which simply consists of the sum of all normalized (0–1) indices. The results are presented inTables 8 and 9, for River and Reservoir phases, respectively.

Considering the uniﬁed index, the best choice for the River phase was PHA algorithm using Cosine distance, whereas for the Reservoir phase the method that generated the best sum of normalized indices was k-means using Euclidean distance. These results diverge from the results of Tables 4 and 5, where only a measure of cohesion (W) is considered. Since the CVIs also consider the separateness of the clusters, Table 3

Results of the suggested number of clusters by L-method in the Reservoir phase.

Euclidian Chebyshev Mahalanobis Cosine Hamming

k-means 5 4 7 5 10

T&BaC 5 4 6 4 4

DE 4 4 28 4 27

PHA 5 6 17 7 2

(9)

their results give a better indication of the quality of the formed clus-ters. The best results in terms of CVIs are used to evaluate the trophic changes ofﬁsh species in the next section.

5.3. Results interpretation

The stomach content analysis of each group (cluster) showed that there are certain feeding habits in each group and that these habits are generally maintained after the damming of the river. The average percentage of stomach content of each group in the River phase, con-sidering the results obtained through PHA technique with Cosine dis-tance (Table 8), and the corresponding group with stomach content in the Reservoir phase, considering the results obtained through k-means technique with Euclidean distance (Table 9), are shown inTable 10. The food items that correspond to over 5% of the total content are highlighted, and the last column shows the percentage of the sampled ﬁshes that were present in each group in both phases.

There was a significant reduction in the population of Cluster 1, which represented 51.4% of the sample in the River phase and dropped to 16.2% in the Reservoir phase. There was also a significant difference

in the case of Cluster 5, with its population rising from 6.2% to 34.8% of the sample. Regarding feeding habits, the increase in the percentage of aquatic insectivores (Aquins) should be highlighted in the foodstuff of Cluster 1, followed by a reduction in Cluster 2, which saw an increase in the consumption of terrestrial insects (Terins). In Cluster 3, there was a significant rise in the consumption of micro-crustaceans, while the consumption of algae (Algae) and debris (Debsed) was reduced. Important changes in Clusters 4 and 5 were the rise of terrestrial ve-getation (Terveg) andfish (Fish), respectively, as part of the dietary of the correspondent species.

According to the biologists, for a “formal” classification of each species offish in each cluster, it is necessary to verify the amount of food in their stomach. The species is included in the class in which it has the greatest representation of stomach content. The classes can be “insectivore”, “piscivore”, “herbivore” and “detritivore”. The species that are not representative in any cluster (with a certain level of re-presentation in two or more clusters) may be considered“generalists”, with several different eating habits.

More speciﬁcally, inTable A1, which shows the quantity of each species in each cluster generated by PHA and k-means, for River and Table 4

Performance of each combination algorithm vs. distance in the River phase.

W Time W Time W Time W Time W Time

k-means Mean 556.3 0.010 458.4 0.010 2926.0 0.022 209.6 0.017 1883.6 0.003 Min 470.8 0.004 384.0 0.005 2325.0 0.012 127.2 0.008 1553.9 0.002 Max 823.7 0.019 713.8 0.023 3445.2 0.058 424.1 0.057 1895.0 0.017 std 88.1 0.004 68.1 0.005 283.7 0.011 73.8 0.011 62.3 0.003 T&BaC Mean 429.5 1.376 321.4 1.136 2047.8 1.425 129.7 1.485 291.5 1.069 Min 429.5 1.054 321.4 1.069 2047.8 1.365 129.7 1.095 291.5 0.996 Max 429.6 2.063 321.4 1.566 2047.8 1.551 129.7 2.166 291.5 1.167 std 0.1 0.268 0.0 0.089 0.0 0.050 0.0 0.303 0.0 0.053 DE Mean 429.7 25.744 363.4 29.325 4.69E+06 96.971 127.5 52.354 334.1 26.347 Min 425.9 25.591 320.9 29.003 6055.4 96.794 126.8 51.504 321.2 26.056 Max 448.2 25.912 428.8 29.568 1.49E+07 97.183 129.1 62.040 346.4 26.575 std 4.9 0.079 22.9 0.127 3.66E+06 0.097 0.6 1.855 10.7 0.130 PHA Mean 574.9 0.435 472.6 0.444 4124.4 0.748 146.3 0.422 1890.0 0.453 Min 574.9 0.429 472.6 0.436 4124.4 0.739 146.3 0.416 1890.0 0.449 Max 574.9 0.450 472.6 0.499 4124.4 0.873 146.3 0.430 1890.0 0.459 std 0.0 0.005 0.0 0.012 0.0 0.025 0.0 0.004 0.0 0.003 Table 5

Performance of each combination algorithm/distance in Reservoir phase.

W Time W Time W Time W Time W Time

k-means Mean 1706.6 0.013 1342.0 0.012 8332.9 0.046 716.0 0.019 5524.2 0.006 Min 1042.5 0.007 844.4 0.006 5999.5 0.023 318.5 0.010 5037.4 0.004 Max 3157.2 0.029 2887.8 0.026 11047.7 0.083 1424.9 0.047 5541.0 0.017 std 658.9 0.005 509.2 0.006 1306.7 0.017 374.8 0.008 92.0 0.002 T&BaC Mean 901.9 7.510 665.1 7.397 4980.8 8.844 325.1 7.585 511.5 7.461 Min 901.9 7.252 665.1 7.295 4980.8 8.673 323.0 7.253 511.5 7.304 Max 901.9 8.416 665.1 7.719 4980.8 9.044 332.0 10.670 511.5 8.227 std 0.0 0.285 0.0 0.093 0.0 0.106 3.9 0.825 0.0 0.180 DE Mean 905.7 54.803 862.1 61.786 6.78E+06 229.748 320.1 89.911 706.0 57.316 Min 896.7 54.209 706.4 61.236 27088.4 228.366 318.0 85.852 632.4 55.529 Max 949.9 57.663 1020.2 64.875 2.51E+07 240.265 324.0 97.249 788.8 60.270 std 12.4 0.798 88.5 0.808 6.14E+06 2.893 1.6 3.043 68.7 1.534 PHA Mean 1227.8 3.356 2093.5 3.400 12694.3 6.353 1206.6 3.216 5536.0 3.437 Min 1227.8 3.336 2093.5 3.369 12694.3 6.320 1206.6 3.188 5536.0 3.419 Max 1227.8 3.393 2093.5 3.452 12694.3 6.504 1206.6 3.250 5536.0 3.456 std 0.0 0.011 0.0 0.018 0.0 0.037 0.0 0.013 0.0 0.009

(10)

Reservoir phases, respectively, it is possible to observe that, in the River phase, Clusters 1 and 2 are“insectivores”. However, Cluster 2 was not clearly deﬁned because diﬀerent species from other clusters were also allocated to it. Only Species S7 had major representation (53.3%). Cluster 3 is made up of“detritivores”. Cluster 4 is made up of “herbi-vores”, and Cluster 5 is made up of “piscivores”. The generalists of the River phase were Species S4, S26, S27, S35 and S45, as they have dif-ferent feeding habits.

In the Reservoir phase, Cluster 1 was made up of“insectivores”. Cluster 2 had no ﬁxed representation. Cluster 3 was made up of

“detritivores”. Cluster 4 was made up of “herbivores”, and Cluster 5, “piscivores”. The generalists of the Reservoir phase were Species S26 and S42. The species S8, S10, S16 and S32 disappeared in the Reservoir Phase. All this information is summarized inTable 11.

6. Final considerations

This study analyzes a practical application of the clustering process in an analysis of the trophic proﬁle of ﬁsh collected from the Corumbá Reservoir before and after the damming process, in order to evaluate Table 6

CVIs for each combination algorithm/distance in River phase.

Euclidean Chebyshev Mahalanobis Cosine Hamming Clusters Value Clusters Value Clusters Value Clusters Value Clusters Value

k-means Dunn 5 0.010 5 0.006 5 0.006 5 0.004 2 0.010 CH 2007.66 1991.57 1720.89 1943.94 6.632 DB 0.653 0.665 0.572 0.626 3.858 Si 0.777 0.777 0.749 0.779 -0.133 T&BaC Dunn 5 0.010 5 0.001 5 0.010 5 0.000 5 0.000 CH 1152.99 963.88 1216.54 1276.97 272.75 DB 1.036 1.008 1.059 0.916 1.247 Si 0.506 0.293 0.532 0.716 0.279 DE Dunn 5 0.006 5 0.006 1 – 5 0.009 2 0.000 CH 1981.74 1960.46 – 1959.65 548.603 DB 0.732 0.656 – 0.629 1.704 Si 0.764 0.768 – 0.780 0.427 PHA Dunn 5 0.013 5 0.013 1 – 5 0.026 1 – CH 1167.73 1183.18 – 1676.98 – DB 0.511 0.596 – 0.591 – Si 0.637 0.655 – 0.745 – Table 7

CVIs for each combination algorithm vs. distance in Reservoir phase.

Euclidean Chebyshev Mahalanobis Cosine Hamming Clusters Value Clusters Value Clusters Value Clusters Value Clusters Value

k-means Dunn 5 0.025 5 0.024 5 0.022 5 0.024 1 – CH 8637.61 8886.11 7083.76 8791.29 – DB 0.582 0.796 0.445 0.758 – Si 0.851 0.853 0.833 0.856 – T&BaC Dunn 5 0.004 5 0.000 4 0.002 5 0.018 5 0.000 CH 3911.33 2187.61 4394.71 8629.80 977.11 DB 0.827 0.773 0.608 0.717 1.134 Si 0.656 0.639 0.724 0.854 0.206 DE Dunn 5 0.002 5 0.016 1 – 5 0.006 2 0.000 CH 8603.30 8818.49 – 8791.82 3103.51 DB 0.576 0.827 – 0.757 0.974 Si 0.851 0.847 – 0.855 0.502 PHA Dunn 5 0.021 4 0.021 1 – 5 0.014 1 – CH 6887.94 2001.02 – 1418.06 – DB 0.482 0.828 – 0.793 – Si 0.817 0.572 – 0.560 – Table 8

Uniﬁed indices for each combination algorithm/distance in the River phase.

Euclidean Chebyshev Mahalanobis Cosine Hamming Clusters Unif index Clusters Unif index Clusters Unif index Clusters Unif index Clusters Unif index

k-means 5 3.3348 5 3.1760 5 3.0104 5 3.0707 2 0.3845

T&BaC 5 2.4169 5 1.7365 5 2.4804 5 2.3926 5 1.2116

DE 5 3.1213 5 3.1475 1 – 5 3.2598 2 1.3991

(11)

any possible environmental impacts. It is very important to know be-forehand, i.e., before the damming, how a community that depends on ﬁshing, and how the environment in general, will be aﬀected.

Diﬀerent clustering methods were analyzed and compared: k-means, DE, PHA, and T&BaC. The latter is an algorithm originally used for FLP (T&B) and, to the best of the authors’ knowledge, this is the ﬁrst time that it has been used in a Clustering Analysis problem. The T&BaC proved to be a promising technique for clustering, achieving excellent results in terms of objective function and computational time, with good convergence power, as can be observed by the low standard de-viation of the results. It was the only method evaluated that produced satisfactory results in all the distance measures analyzed.

One of the most important stages of the clustering process is de-fining the number of clusters. In the present work, the L-method was employed. It proved to be effective for defining the number of clusters. Despite being costly in computational terms for algorithms that are not hierarchical, it presented highly robust results. The cluster validation stage is another essential phase of the clustering process, although it is often ignored. In the data analyzed for this study, the methods with the best results, in terms of total distance, were T&BaC (for all distances except Cosine) and DE (when considering Cosine distance), in both River and Reservoir phases, however, when it comes to validation in-dices, where other measures than distance are considered, PHA and k-means produced the best clusters in River and Reservoir phases,

respectively. This highlights the difficulty involved in selecting a sui-table method for each specific case of clustering, and the importance of careful verification of the results to define the clusters adequately, especially in large dimension problems, where visual analysis is diffi-cult.

The diet patterns observed in the River phase were maintained during the Reservoir phase, with some changes in the feeding behavior of clusters 1, 2 and 3. There were significant alterations in the per-centages of the population of clusters 1 and 5 after the damming phase. Cluster 1, which accounted for 51.4% of the total sample in the River phase, was reduced to 16.2% after the building of the dam. Cluster 5, which accounted for only 6.2% of the sample in the river phase, ac-counted for 34.8% of the sample in the Reservoir phase. It was observed that within a species, thefish had different feeding habits, with almost all species being found in more than one group. A better discrimination between the clusters was obtained and shown inTable 11.

An important limitation of this work was the number of re-presentatives of each species and the imbalance between these num-bers, as shown inFig. 1and Table A. However, this does not invalidate the methodology, nor the results obtained, which provide at least an introductory report for biologists. Furthermore, it should be re-membered that the methodology used can be applied to other problems in theﬁeld and other clustering problems.

The eﬀectiveness shown by the T&BaC algorithm provides Table 9

Uniﬁed indices for each combination algorithm/distance in the Reservoir phase.

Euclidean Chebyshev Mahalanobis Cosine Hamming Clusters Unif index Clusters Unif index Clusters Unif index Clusters Unif index Clusters Unif index

k-means 5 3.7628 5 3.4236 5 3.6108 5 3.4704 1 –

T&BaC 5 1.6612 5 1.3599 4 2.0738 5 3.2625 5 0.000

DE 5 2.8465 5 3.0647 1 – 5 2.7818 2 0.9572

PHA 5 3.4561 4 1.9605 1 – 5 1.6549 1 –

Table 10

Percentage of each type of food per group in the River and Reservoir phases.

Cluster Phase Micro (%) Aquins (%) Terins (%) Otaqinv (%) Otterinv (%) Fish (%) Ottervert (%) Algae (%) Aquveg (%) Terveg (%) Debsed (%) % 1 River 0.85 80.30 5.32 0.86 0.41 0.62 0.00 0.74 1.00 1.56 8.33 51.4 Reservoir 1.25 86.19 2.34 0.49 0.16 0.83 0.00 0.28 0.17 2.38 5.90 16.2 2 River 0.40 14.96 79.33 0.23 0.55 0.68 1.06 0.96 0.35 0.74 0.74 9.3 Reservoir 0.54 6.10 86.08 0.20 2.66 0.84 0.00 0.10 0.00 3.10 0.38 6.5 3 River 0.19 3.57 1.34 0.47 0.00 0.24 0.00 20.90 0.39 2.35 70.55 14.9 Reservoir 10.28 3.78 0.22 0.40 1.36 0.64 0.08 15.27 1.46 2.00 64.50 21.5 4 River 0.11 5.17 3.09 0.08 0.06 0.12 0.00 1.53 0.05 88.75 1.03 18.2 Reservoir 0.09 2.43 1.56 0.03 0.07 1.17 0.00 0.34 0.13 93.55 0.64 21.0 5 River 0.01 1.97 1.66 0.13 0.00 93.62 0.00 0.17 0.00 1.51 0.92 6.2 Reservoir 0.02 0.47 0.20 0.03 0.03 98.45 0.00 0.06 0.01 0.66 0.06 34.8 Table 11

Categorization of species ofﬁsh (fromTable A1).

Clusters River Phase Species Reservoir Phase Species Generalist (G) – S4; S26; S27; S35; S45 – S26; S42 Specialist A: Insectivores S3; S5; S6; S7; S9; S10; S11; S12; S15; S16; S17; S18; S24; S25; S29; S31; S32; S38; S40; S42; S44; S47; S48; S52 A: Insectivores S3; S5; S9; S11; S12; S15; S17; S18; S19; S24; S25; S30; S31; S48; S52 B: Piscivores S14; S20; S21; S37; S49; S51 B: Piscivores S13; S14; S20; S21; S27; S37; S39; S40; S43; S45; S49; S51 C: Herbivores S2; S28; S33; S34; S50 C: Herbivores S2; S4; S7; S28; S29; S33; S34; S35; S38; S44; S50 D: Detritivores S1; S8; S22; S23; S41; S46 D: Detritivores S1; S6; S22; S23; S36; S41; S46; S47 .

(12)

motivation for future studies, such as its application to benchmark problems and comparison with other traditional approaches and changes in the optimization function to consider separateness during the clustering search.

Acknowledgments

The authors would like to thank the personal from the Research

Nucleus on Limnology, Ichthyology and Aquaculture (Nupélia) of Universidade Estadual de Maringá (UEM) and from FURNAS Centrais Elétricas S.A., Brazil, involved in the“Ichthyologic Studies on the Area Aﬀected by the Corumbá Hydroelectric Plant” project for providing data and insights for the accomplishment of this study. The 1st author also would like to thank CAPES and CNPq for theﬁnancial support and the 2nd and the 3rd authors to CNPq for awarding them Productivity in Research scholarships.

Appendix A.

Table A1

Distribution of 52ﬁsh species (partial) in each phase (River e Reservoir; columns A to D) and their classiﬁcation in clusters 1–5 (columns E and F).

Sp (A) Species Name (B)

River Phase (C) Reservoir Phase (D) River Phase (E)

Cluster 1 (insectivores) Cluster 2 (insectivores) Cluster 3 (detritivores) Cluster 4 (herbivores)

Qty % Qty % Qty % Qty % Qty % Qty %

S1 A. aﬃnis 39 2.1 127 2.3 7 15.8 2 5.3 30 78.9 0 0.0 S2 A. bimacul 84 4.4 478 8.6 8 9.5 16 19.0 0 0.0 60 71.4 S3 A. eigenman 93 4.9 9 0.2 35 37.6 29 31.2 17 18.3 12 12.9 S4 A. fasciatu 85 4.5 254 4.6 33 38.8 8 9.4 5 5.9 37 43.5 S5 A. ibitiens 10 0.5 16 0.3 6 60.0 0 0.0 1 10.0 3 30.0 S6 A. piracica 29 1.5 17 0.3 16 55.2 0 0.0 11 37.9 2 6.9 S7 A. scabripi 15 0.8 5 0.1 2 13.3 8 53.3 3 20.0 2 13.3 S8 B. natterer 5 0.3 0 0.0 0 0.0 2 40.0 3 60.0 0 0.0 S9 B. stramine 266 14.0 96 1.7 196 73.7 50 18.8 4 1.5 13 4.9 S10 Brycona sp3 100 5.3 0 0.0 73 73.0 19 19.0 5 5.0 1 1.0 S11 C. haroldoi 21 1.1 6 0.1 13 61.9 1 4.8 0 0.0 5 23.8 S12 C. iheringi 24 1.3 10 0.2 21 87.5 3 12.5 0 0.0 0 0.0 S13 C. monoculu 0 0.0 104 1.9 0 0 0 0 S14 C. paranaen 7 0.4 11 0.2 1 14.3 0 0.0 1 14.3 0 0.0 S15 C. zebra 12 0.6 10 0.2 9 75.0 0 0.0 0 0.0 1 8.3 S16 Corydor sp 27 1.4 0 0.0 24 88.9 0 0.0 2 7.4 1 3.7 S17 Eigenma sp 20 1.1 12 0.2 19 95.0 1 5.0 0 0.0 0 0.0 S18 G. carapo 11 0.6 12 0.2 6 54.5 0 0.0 5 45.5 0 0.0 S19 G. cesarpin 0 0.0 9 0.2 0 0 0 0 S20 G. knerii 40 2.1 935 16.9 3 7.5 1 2.5 0 0.0 1 2.5 S21 H. malabari 9 0.5 99 1.8 1 11.1 1 11.1 1 11.1 0 0.0 S22 H. regani 38 2.0 70 1.3 2 5.3 0 0.0 34 89.5 2 5.3 S23 Hyposto sp 17 0.9 7 0.1 2 11.8 0 0.0 15 88.2 0 0.0 S24 I. labrosus 140 7.4 234 4.2 123 87.9 1 0.7 15 10.7 0 0.0 S25 L. amblyrhy 57 3.0 69 1.2 52 91.2 0 0.0 3 5.3 1 1.8 S26 L. elongatu 8 0.4 151 2.7 2 25.0 0 0.0 3 37.5 3 37.5 S27 L. frideri 35 1.8 501 9.0 3 8.6 0 0.0 1 2.9 16 45.7 S28 L. octofasc 6 0.3 13 0.2 0 0.0 0 0.0 0 0.0 5 83.3 S29 L. reticula 14 0.7 7 0.1 6 42.9 1 7.1 4 28.6 3 21.4 S30 L. vittatus 0 0.0 12 0.2 0 0 0 0 S31 M. intermed 16 0.8 258 4.7 14 87.5 2 12.5 0 0.0 0 0.0 S32 M. platanus 5 0.3 0 0.0 2 40.0 2 40.0 0 0.0 0 0.0 S33 M. tiete 33 1.7 63 1.1 2 6.1 0 0.0 1 3.0 28 84.8 S34 Metynni sp 2 0.1 2 0.0 0 0.0 0 0.0 0 0.0 2 100.0 S35 Odontos sp 21 1.1 7 0.1 8 38.1 2 9.5 9 42.9 1 4.8 S36 O. niloticu 0 0.0 1 0.0 0 0 0 0 S37 O. planalt 8 0.4 19 0.3 1 12.5 0 0.0 2 25.0 0 0.0 S38 P. argentea 269 14.2 340 6.1 187 69.5 15 5.6 10 3.7 55 20.4 S39 P. corrusca 0 0.0 14 0.3 0 0 0 0 … and so on… … … … S52 T. neivae 15 0.8 12 0.2 13 86.7 1 6.7 1 6.7 0 0.0 Total 1895 100.0 5541 100.0 975 176 282 345

Sp (A) River Phase (E) Reservoir Phase (F)

Cluster 5 (piscivores) Cluster 1 (insectivores) Cluster 2 (no represent.) Cluster 3 (detritivores) Cluster 4 (herbivores) Cluster 5 (piscivores)

S1 0 0.0 3 2.4 0 0.0 122 96.1 2 1.6 0 0.0

S2 0 0.0 77 16.1 88 18.4 37 7.7 206 43.1 70 14.6

S3 0 0.0 4 44.4 2 22.2 1 11.1 2 22.2 0 0.0

(13)

References

Abbas, O. A. (2008). Comparisons between data clustering algorithms. The International Arab Journal of Information Technology, 5(3), 320–325.

Agostinho, A. A., Gomes, L. C., Pelicice, F. M. (2007). Ecologia e manejo de recursos pesqueiros em reservatórios do Brasil. Universidade Estadual de Maringá (UEM).

Alhourani, F. (2013). Clustering algorithm for solving group technology problem with multiple process routings. Computers & Industrial Engineering, 66(4), 781–790.

Arampantzi, C., & Minis, I. (2017). A new model for designing sustainable supply chain networks and its application to a global manufacturer. Journal of Cleaner Production, 156, 276–292.

Azzag, H., Venturini, G., Oliver, A., & Cuinot, C. (2007). A hierarchical ant based clus-tering algorithm and its use in three real-world applications. European Journal of Operational research, 179, 906–922.

Babazadeh, R., Razmi, J., Pishvaee, M. S., & Rabbani, M. (2017). A sustainable second-generation biodiesel supply chain network design problem under risk. Omega, 66, 258–277.

Bai, C., Dhavale, D., & Sarkis, J. (2016). Complex investment decisions using rough set and fuzzy c-means: An example of investment in green supply chains. European Journal of Operational Research, 248(2), 507–521.

Bi, G.-B., Song, W., & Wu, J. (2014). A clustering method for evaluating the environ-mental performance based on slacks-based measure. Computers & Industrial Engineering, 72, 169–177.

Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communication in Statistics Theory Methods, 3(1), 1–27.

Chou, C.-A., Chaovalitwongse, W. A., Berger-Wolf, T. Y., DasGupta, B., & Ashey, M. V. (2012). Capacitated clustering problem in computational biology: Combinatorial and statistical approach for sibling reconstruction. Computers & Operations Research, 39, 609–619.

Chuang, Y.-F., Lee, H.-T., & Lai, Y.-C. (2012). Item-associated cluster assignment model

on storage allocation problems. Computers & Industrial Engineering, 63, 1171–1177.

Cunico, A. M., & Agostinho, A. M. (2006). Morphological Patterns of Fish and Their Relationships with Reservoirs Hydrodynamics. Brazilian Archives of Biology and Technology, 49(1), 125–134.

Dai, X., & Kuosmanen, T. (2014). Best-practice benchmarking using clustering methods: Application to energy regulation. Omega, 42, 179–188.

Davies, D. L., & Boulding, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2), 224–227.

Dietrich, J. P., Popp, A., & Lotze-Campen, H. (2013). Reducing the loss of information and gaining accuracy with clustering methods in a global land-use model. Ecological Modelling, 263, 233–243.

Dunn, J. C. (1973). A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3(3), 32–57.

Fard, A. M. F., & Hajaghaei-Keshteli, M. (2018). A tri-level location-allocation model for forward/reverse supply chain. Applied Soft Computing, 62, 328–346.

Fielding, A. H. (2007). Cluster and classiﬁcation techniques for the biosciences. New York, USA: Cambridge University Press.

Forgy, E. (1965). Cluster analysis of multivariate data: Eﬃciency vs. interpretability of classiﬁcation. Biometrics, 21, 768–769.

Franco, D. G. B., & Steiner, M. T. A. (2018). Clustering of solar energy facilities using a hybrid fuzzy c-means algorithm initialized by metaheuristics. Journal of Cleaner Production, 191, 445–457.

Hajiaghaei-Keshteli, A., & Fard, A. M. F. (2018). Sustainable closed-loop supply chain network design with discount supposition. Neural Computing and Applications, 1–35.

Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2/3), 107–145.

Hruschka, E. R., Campello, R. J. G. B., Freitas, A. A., & Carvalho, A. C. P. L. F. (2009). A survey of evolutionary algorithms for clustering. IEEE Transactions on Systems, Man, and Cybernetics, 39(2), 133–155.

Ibarra, A., Gevrey, M., Park, Y.-S., Lim, P., & Lek, S. (2003). Modeling the factors that inﬂuence ﬁsh guilds composition using a back-propagation network: Assessment of Table A1 (continued)

Sp (A) River Phase (E) Reservoir Phase (F)

Cluster 5 (piscivores) Cluster 1 (insectivores) Cluster 2 (no represent.) Cluster 3 (detritivores) Cluster 4 (herbivores) Cluster 5 (piscivores)

S5 0 0.0 14 87.5 0 0.0 1 6.3 0 0.0 1 6.3 S6 0 0.0 1 5.9 0 0.0 13 76.5 3 17.6 0 0.0 S7 0 0.0 1 20.0 0 0.0 0 0.0 4 80.0 0 0.0 S8 0 0.0 0 0 0 0 0 S9 3 1.1 51 53.1 42 43.8 2 2.1 1 1.0 0 0.0 S10 2 2.0 0 0 0 0 0 S11 2 9.5 3 50.0 2 33.3 0 0.0 0 0.0 1 16.7 S12 0 0.0 8 80.0 2 20.0 0 0.0 0 0.0 0 0.0 S13 0 0 0.0 3 2.9 17 16.3 1 1.0 83 79.8 S14 5 71.4 2 18.2 0 0.0 1 9.1 1 9.1 7 63.6 S15 2 16.7 9 90.0 1 10.0 0 0.0 0 0.0 0 0.0 S16 0 0.0 0 0 0 0 0 S17 0 0.0 11 91.7 0 0.0 1 8.3 0 0.0 0 0.0 S18 0 0.0 7 58.3 2 16.7 2 16.7 0 0.0 1 8.3 S19 0 6 66.7 3 33.3 0 0.0 0 0.0 0 0.0 S20 35 87.5 6 0.6 2 0.2 1 0.1 6 0.6 920 98.4 S21 6 66.7 0 0.0 0 0.0 0 0.0 0 0.0 99 100.0 S22 0 0.0 1 1.4 1 1.4 67 95.7 1 1.4 0 0.0 S23 0 0.0 0 0.0 0 0.0 7 100.0 0 0.0 0 0.0 S24 1 0.7 160 68.4 3 1.3 58 24.8 9 3.8 4 1.7 S25 1 1.8 38 55.1 0 0.0 27 39.1 4 5.8 0 0.0 S26 0 0.0 56 37.1 0 0.0 38 25.2 47 31.1 10 6.6 S27 15 42.9 16 3.2 5 1.0 32 6.4 163 32.5 285 56.9 S28 1 16.7 0 0.0 0 0.0 3 23.1 7 53.8 3 23.1 S29 0 0.0 0 0.0 0 0.0 0 0.0 7 100.0 0 0.0 S30 0 12 100.0 0 0.0 0 0.0 0 0.0 0 0.0 S31 0 0.0 124 48.1 41 15.9 52 20.2 38 14.7 3 1.2 S32 1 20.0 0 0 0 0 0 S33 2 6.1 0 0.0 0 0.0 2 3.2 61 96.8 0 0.0 S34 0 0.0 0 0.0 0 0.0 0 0.0 2 100.0 0 0.0 S35 1 4.8 0 0.0 0 0.0 1 14.3 6 85.7 0 0.0 S36 0 0 0.0 0 0.0 1 100.0 0 0.0 0 0.0 S37 5 62.5 5 26.3 5 26.3 1 5.3 0 0.0 8 42.1 S38 2 0.7 95 27.9 40 11.8 12 3.5 189 55.6 4 1.2 S39 0 0 0.0 1 7.1 0 0.0 1 7.1 12 85.7 … … … … S52 0 0.0 10 83.3 1 8.3 0 0.0 1 8.3 0 0.0 117 900 362 1189 1163 1927

(14)

metrics for indices of biotic integrity. Ecological Modelling, 160, 281–290.

Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31, 651–666.

Johnson, R. A., & Wichern, D. W. (2005). Applied multivariate statistical analysis (fourth ed.). New Jersey: Prentice Hall.

José-García, A., & Gómez-Flores, W. (2016). Automatic clustering using nature inspired metaheuristics: A survey. Applied Soft Computing, 41, 192–213.

Kannegiesser, M., Günther, H.-O., & Autenrieb, N. (2015). The time-to-sustainability optimization strategy for sustainable supply network design. Journal of Cleaner Production, 108, 451–463.

Kuila, P., & Jana, P. K. (2014). A novel diﬀerential evolution based clustering algorithm for wireless sensor networks. Applied Soft Computing, 25, 414–425.

Kwedlo, W. (2011). A clustering method combining diﬀerential evolution with the k-means algorithm. Pattern Recognition Letters, 32, 1613–1621.

Lu, Y., & Wan, Y. (2013). PHA: A fast potential-based hierarchical agglomerative clus-tering method. Pattern Recognition, 46, 1227–1239.

MacQueen, J. (1967). Some methods for classiﬁcation and analysis of multivariate ob-servations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probabilities, 1, 281–296.

Mingoti, S. A., & Lima, J. O. (2006). Comparing SOM neural network with Fuzzy c-means, K-means and traditional hierarchical clustering algorithms. European Journal of Operational Research, 174, 1742–1759.

Mirkin, B. (2011). Choosing the number of clusters. WIREs Data Mining and Knowledge Discovery, 1, 252–260.

Nilashi, M., Bagherifard, K., Rahmani, M., & Rafe, V. (2017). A recommender system for tourism industry using cluster ensemble and prediction machine learning techniques. Computers & Industrial Engineering, 109, 357–368.

Oliveira, E. F., Goulart, E., & Minte-Vera, C. V. (2004). Fish Diversity along Spatial Gradients in the Itaipu Reservoir, Paraná. Brazil. Brazilian Journal of Biology, 64(3A), 447–458.

Oliveira, S., Ribeiro, J. F. F., & Seok, S. C. (2009). A spectral clustering algorithm for manufacturing cell formation. Computers & Industrial Engineering, 57, 1008–1014.

Qu, T., Nie, D. X., Li, C. D., Thurer, M., & Huang, G. Q. (2017). Optimal conﬁguration of assembly supply chains based on Hybrid augmented Lagrangian coordination in an industrial cluster. Computers & Industrial Engineering, 112, 511–512.

Rabello, R. L., Mauri, G. R., Ribeiro, G. M., & Lorena, L. A. N. (2014). A clustering search metaheuristic for the point-feature cartographic label placement problem. European Journal of Operational Research, 234, 802–808.

Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.

Sahebjamnia, N., Fard, A. M. F., & Hajiaghaei-Keshteli, M. (2018). Sustainable tire closed-loop supply chain network design: Hybrid metaheuristic algorithms for large-scale networks. Journal of Cleaner Production, 196, 273–296.

Salvador, S., & Chan, P. (2005). Learning Stated and Rules for Detecting Anomalies in Time Series. Applied Intelligence, 23, 241–255.

Santi, E., Aloise, D., & Blanchard, S. J. (2016). A model for clustering data from het-erogeneous dissimilarities. European Journal of Operational Research, 253(3), 659–672.

Shi, S., Yang, G., Wang, W., Zheng, W. (2002). Potential-based hierarchical clustering. In Proceedings of 16th international conference on pattern recognition, Quebec, Canada, 4, (pp. 272–275).

Srinivasan, M., & Moon, Y. B. (1999). A comprehensive clustering algorithm for strategic analysis of supply chain networks. Computers & Industrial Engineering, 36, 615–633.

Storn, R., & Price, K. (1997). Diﬀerential evolutions – A simple and eﬃcient heuristic for global optimization over continuous spaces. Journal of Global Optimization, 11, 341–359.

Tan, P. N., Steinbach, M., & Kumar, V. (2005). Introduction to data mining (1st ed). Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.

Teitz, M. B., & Bart, P. (1968). Heuristics methods for estimating the generalized vertex median of a weighted graph. Operations Research, 16, 955–961.

Theodoridis, S., & Koutrumbas, K. (2009). Pattern recognition (4th. ed.,). Elsevier Inc.

Watts, M. J., & Worner, S. P. (2009). Estimating the risk of insect species invasion: Kohonen self-organising maps versus k-means clustering. Ecological Modelling, 220, 821–829.

Wiwie, C., Baumbach, J., & Röttger, R. (2015). Comparing the performance of biomedical clustering methods. Nature Methods, 12(11), 1033–1038.

Xiang, W., Zhu, N., Ma, S., Meng, X., & An, M. (2015). A dynamic shuﬄed diﬀerential evolution algorithm for data clustering. Neurocomputing, 158, 144–154.

Xu, R., & Wunsch, D., II (2005). Survey of Clustering Algorithms. IEEE Transactions on Neural Networks, 16(3), 645–678.

Zagouras, A., Inman, R. H., & Coimbra, C. F. M. (2014). On the determination of coherent solar microclimates for utility planning and operations. Solar Energy, 102, 173–188.

Zanjani, M. K., Bajgiran, O. S., & Nourelfath, M. (2016). A hybrid scenario cluster de-composition algorithm for supply chain tactical planning under uncertainty. European Journal of Operational Research, 252(2), 466–476.