Management of research data at U.Porto:
from researcher needs to curation workflows
supported on a data repository
Eugénia Matos Fernandes – efernand@reit.up.pt Cristina Ribeiro – mcr@fe.up.pt
João Correia Lopes– jlopes@fe.up.pt João Rocha da Silva– joaorosilva@gmail.com
RECOLECTA Webinar December 19 2011
Contents
§ U.Porto: a research university
§ The U.Porto Information System and Institutional Repository § Scientific Data Curation Project
§ The Data Audit
§ Data Curation Workflow § Data Repository
U.PORTO
Porto Metropolitan Area =
1 000 000 inhabitants
State University created the 22nd
March 1911
Origins date back to the 18thcentury
U.PORTO :: Geographic distribution
Pole 1 Pole 2 Pole 32
1
3
U.PORTO :: Schools and Research Units
§ Rectorate/Central Services § 14 Schools
Ø School of Architecture Ø School of Fine Arts Ø School of Sciences
Ø School of Nutrition and Food Science Ø School of Sport Ø School of Law Ø School of Economics Ø School of Engineering Ø School of Pharmacy Ø School of Arts Ø School of Medicine
Ø School of Dental Medicine
Ø School of Psychology and Education Science Ø Institute of Biomedical Sciences Abel Salazar Ø Business School
§ ~70 R&D+i units
Ø 31 assessed as excellent or very good
§ 30 Libraries + 12 Museums § Student Support Services
U.PORTO :: Academic Community
§ Students Ø 30.898 (total) Ø 8% mobility Ø 1st cycle → 9.647 Ø Integrated Master → 12.758 Ø Master + 2nd cycle → 5.406 Ø Specialization → 4258 Ø PhD + 3rd cycle → 2.828§ Teachers & researchers
Ø 2.366
→ 76% PhD
§ Technical & Administrative staff
U.Porto :: Teachers & Researchers with PhD
76% of all the academic community
0 10 20 30 40 50 60 70 80 90 100 %
U.Porto :: Position in international
rankings
International Rankings Portugal 2011 Europe 2011 World 2011 Portuga l 2010 Europe 2010 World 2010
Academic Ranking of World Universities (Shangai Jiao
Tong University) 1 124-164 301-400 1 169-204 401-500
Performance Ranking of Scientific Papers for World
Universities (Taiwan) 1 141 320 1 141 328
Quacquareli Symonds – QS
World University Rankings 2 185-203 401-450 3 - 451-500
Webometrics (CSIC, Madrid) 1 50 178 1 79 230
The Leiden Ranking 1 112 280 1 136 -
SCImago Institutions
Rankings (SIR) 1 77 254 1 90 265
University Ranking by Academic Performance
U.PORTO: INFORMATION SYSTEM &
INSTITUTIONAL REPOSITORY
U.PORTO :: Institutional Repository
Nov 2007
< 1.000
publications
Nov 2011
+ 18.000 publications
Publications :: From SIGARRA to the Open Repository
SIGARRA
OPEN REPOSITORY
Migration of full text & open access publications
U.PORTO :: Open Repository :: 2008-2011
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000U.Porto :: Scientific Domains
§ EXACT SCIENCES § NATURAL SCIENCES § HEALTH SCIENCES § ENGINEERING AND TECHNOLOGY § SOCIAL SCIENCESU.Porto :: Scientific Domains and Sub-domains
§ EXACT SCIENCES § Physics § Mathematics § Chemistry § NATURAL SCIENCES§ Earth and Space Sciences § Biological Sciences
§ Agricultural Sciences § …
§ ARTS AND HUMANITIES
§ Literature Studies § Biological Sciences § Art Studies
§ ...
§ SOCIAL SCIENCES
§ Economics and Management § Law and Political Sciences § Educational Sciences and
Policies § Communication Sciences § … § ENGINEERING AND TECHNOLOGY SCIENCES § Civil Engineering § Electrical Engineering § Informatics § Mechanical Engineering § … § HEALTH SCIENCES
2012-2015 :: Scientific Data at U.Porto
Full text Open access SIGARRA OPEN REPOSITORY THEMATIC REPOSITORYSCIENTIFIC DATA REPOSITORY
INSTITUTIONAL REPOSITORY • Ingest • Storage • Preservation • Access • Dissemination
MANAGING RESEARCH DATA
AT U.PORTO
The “standard” research workflow
Base Data Publication
However…
U.PORTO :: Data Curation Initiative
§ Curation of and access to the scientific data generated
by researchers
§ Short study: February 1 to September 30, 2011
§ Expected results
Ø A first impression on researchers’ needs Ø Sample of existing curation practices Ø Sample of existing datasets
Ø Analysis of datasets from a technical point of view
Ø Collection of datasets in a standard repository platform Ø Experimental interrogation of datasets
Project Phases
Gather Datasets &
Use Cases
Specify Workflow
Build platform
Evaluating the research data management effort
§
Interviewing researchers in several areas
§
Collecting data samples
§
Documenting use cases for research data
§
Identifying data curation practices
§
... evaluate resources and select the problems to be
addressed
Our users, the researchers
§
…are not data preservation experts
§
...use many document formats
Address researcher’s needs
§ Repositories cannot be “graveyards for data”, they have to provide effective ways to access the stored data
§ Data has to be well annotated or else cannot be reused (experiment contexts, meanings of variables…)
§ Better ways to find data (e.g. domain-specific restrictions and not just generic metadata)
§ Easy sharing of data (e.g. sending a link to the place where a user can find a specific dataset)
§ Researchers can be cited by their peers through the datasets that they offer
Project Phases
Gather Datasets &
Use Cases
Specify Workflow
Build platform
Deposit Datasets
Phase 1 :
Interviews
Interviews :: Nature of data
§
Data managed by the researchers
Ø Personally collected in the context of projects
Ø Obtained in the context of contracts with external entities Ø Automatically collected from experimental setups
Interviews :: Curation Practices
§ Mostly informal
Ø Researchers keep copies of data in personal machines and additional removable media
Ø Group leaders keep record of experiments and associate data to published results
Ø For some non-active data, only paper records exist
§ Exception: ecology group
Ø Preparing a curation plan in the context of an international project
§ Some data can be re-generated
Ø Queries to databases of official statistics
§ Some data is processed by specialized software
Interviews :: Use Cases
§ Publication
Ø Relation with published material very relevant
§ Re-use within a group
§ Sharing with project partners § Use in industry
Ø Data with relevance for economic processes (ex: gravimetry)
Ø Data collected by industrial partners for contract work (ex: pollutant analysis)
§ Search data
Interviews :: Metadata
§
Mostly inexistent
Ø Researchers add some annotations for their own use Ø Dataset-level metadata missing
§
Data from interviews (social sciences)
Ø Some metadata from interview scripts
§
Possible source: experimental setup scripts
§
Some domains are more advanced
Data :: Domains and Access Conditions
Domain
Dataset
Access
Astronomy
Gravimetry
Free
Chemical Engineering
Pollutant analysis
Contract pending
Mechanical Engineering
Material fracture
Embargoed
Civil Engineering
High-speed railways
Embargoed
Educational Science
Interviews
Embargoed
Psychology
Interaction records
Embargoed
Economy
Population
Embargoed
Ecology
Plant distribution
Embargoed
Interviews :: What is left out
§
Several interviews revealed complex cases
Ø Data which resulted from past projects and is no longer used
Ø Data in non-digital formats
Ø Data with complex ethics constraints
§
Current concern is with data for which the creators are
available and interested in curation
THE RESEARCH WORKFLOW:
Project Phases
Gather Datasets &
Use Cases
Specify Workflow
Build platform
Deposit Datasets
Phase 2 :
Specify
Workflow
The role of the “Data Curator”
Data
Curator
Researcher
Data curation meeting
Meeting
Annotating data
Silva, João Rocha
Azores GPS Run 01-01-2011 License: CC ShareAlike dc:contributor.author dc:lastModified dc:title dc:rights 38.760267493 -27.084113746 488500.999190 38.760267489 -27.084113743 488499.999191 -27.084113739 488498.999192 38.760267506 488497.999193 38.760267485 -27.084113744 -27.084113730 488496.999194 38.760267507
time.gps_sow latitude longitude
Terceira Flores Table-level metadata Data Dimensions END_METADATA -107.391006 -93.994527 -80.584969 -67.168032 -53.750371 gravity.specific
After the meeting
Repository
How other researchers will see it
•
Explore
•
Filter
•
Download just what you need
Researcher
Project Phases
Gather Datasets &
Use Cases
Specify Workflow
Build platform
Deposit Datasets
Phase 3 :
Build tools to
support
the workflow
UPData Scientific Data Module XSLT Transformer DSpace Core Dynamic Table Formatted Results Query translator XQuery FLWOR Original File Formatted Spreadsheet match 5 4 3 Translated Document (XML) Ingestion page Filtering Query (JSON) XML Manager Results (Data + Metadata) 1 Data Access XLSX Parser 2 Researcher Curator Filtering Request
Project Phases
Gather Datasets &
Use Cases
Specify Workflow
Build platform
Deposit Datasets
Phase 4 :
Test tool
using
real world
data
DATA DEPOSIT
DATA EXPLORING AND DOWNLOAD
- DEMO (VIDEO 2)
FIND DATASETS
- DEMO (VIDEO 3)
Data Curation :: Preliminary conclusions
§ Interaction with researchers is crucial
Ø Data with very different structure, contents and volume
§ Similar use cases in data search
Ø Suggests models with common search features
§ U.Porto Data Repository
Ø Project encourages the definition of a data curation policy
§ DSpace has been successfully customized to include Data
Exploration capabilities for tabular data
§ Open Access is not yet an issue
Ø Project contributes to get researchers confidence on the approach first
Data Curation :: Validating the prototype
§ Next steps with the researchers
Ø Presenting their data in the developed repository platform Ø Evaluating the perceived usefulness of the implemented
features
Ø Gathering feedback on additional features to be implemented
→ Connecting datasets to their publications?
→ Offering more sophisticated data access controls?
Data Management :: A Service
§ What does a data management service look like?
Ø Data curation as an ongoing process and not only at the end Ø Online documentation to help researchers know what is their
role in the process
Ø Support/training in the usage of the platform for self-deposit
Future Work
Ø Gather feedback on the data repository extension from the group of researchers who have been interviewed
Ø Additional features of the repository
→ Fine-grained data access control
→ Data dissemination through standard representations (OAIS…)
Ø Dataset-level metadata
→ DCMI - Science Metadata
Ø Features of a data management service for U.Porto
→ Require further exploration
Ø Data management policy for U.Porto