A Content Development Plan
for Datasets at the British Library

Jez Cope, Data Services Lead

IDCC, 19–21 Februry 2024

1. Introduction

1.0.1. What this talk is not

  • A description of the British Library’s strategy for datasets
  • The British Library’s official position

1.0.2. What this talk is

  • Observations that shape my current thinking
  • Heavily influenced by ideas from many others
  • A provocation and conversation starter

1.1. The British Library

  • National Library of the United Kingdom
  • One of 6 Legal Deposit libraries
    • Includes non-print since 2013
  • Over 170 million items
  • I usually have more fun stats than this…

“Our Mission: We make our intellectual heritage accessible to everyone, for research, inspiration and enjoyment”

1.2. Datasets at the BL

  • Not a traditional focus of the British Library!
  • Over a decade of work in various areas
  • Data, Data Science, ML/AI increasingly prominent in strategy

“We will build on this world-leading research, working in partnership to improve access to our data and new scalable, ethical machine learning tools that can unlock value from digitised collections.”
Knowledge Matters: The British Library 2023–2030

1.2.1. DataCite

  • British Library a founding partner in DataCite in 2009/10
  • Consistent growth over 10 years in the UK
    • Now a consortium of 111 members with 167 Repositories
  • Includes HEIs, cultural heritage organisations, government- and charity-funded research institutions

1.2.2. Strengths

  • Lots of existing datasets available
    • Collection metadata (BNB)
    • Digitised books, manuscripts, etc.
    • Derived datasets in our research repository bl.iro.bl.uk
    • Flagship research projects e.g. Living with Machines

1.2.3. Strengths

  • Wealth of expertise and enthusiasm to draw on
  • Recent research e.g. into Emerging Formats
  • Broad national & international network
  • Ongoing Digital Scholarship training programme

1.2.4. Challenges

  • Plenty of good practice, but scattered
  • Primary focus on print (and print-like) content
  • Priority is maintenance of the existing collection

2. Content Development at the British Library

2.1. Key principles

  • There is one British Library collection
  • Legal Deposit is the foundation of our collection building
  • Priorities are based on demand and value to researchers
  • Avoid duplication; consider collections held elsewhere
  • Prefer digital; connect vs. collect

2.2. Overarching Content Strategy

  • Long timescale
  • Cross-cutting principles
  • Governance and decision-making structure

2.3. Content Development Plans (CDPs)

  • Shorter timescale
  • Driven by evidence of user need
  • Two types
    1. Subject-based
    2. Format-based

3. A CDP for Datasets

  • Bring datasets into Content Strategy framework
  • Buy-in from senior stakeholders
  • Embed into our core collecting practice
  • Format-based: cuts across all collection areas

3.1. Observations

3.1.1. Without data, we are not seeing the whole picture

  • Part of our cultural heritage is data
  • Part of the scholarly record is data

3.1.2. Innovations in research methodology have outstripped our capacity to provide data to those using them

3.1.3. Data derived from the collection is

  1. Valuable
  2. Expensive to (re)create, and
  3. Worth preserving

3.1.4. The rights environment is complex

  • Datasets not covered by Non-Print Legal Deposit
  • Changing this needs primary legislation
  • Legal Deposit complicates even open-licensed content

3.1.5. Datasets need specialised discovery mechanisms

3.2. Priorities

  1. Data already in the collection
  2. Data derived from existing collection items
  3. Data not currently collected

3.3. (Potential) actions

3.3.1. Audit and integrate existing datasets more closely into the national collection

  • Establish documentation using datasheets for datasets as standard practice
  • Connect to known external sources of data

† See e.g. Alkemade, H. et al. (2023) “Datasheets for digital cultural heritage datasets” 10.5334/johd.124

3.3.2. Further develop training & guidance for curators

  • Affordances of cultural heritage data & how researchers use it
  • Subject-specific selection criteria & priorities

3.3.3. Clarify rights and licensing options

  • “As open as possible; as closed as necessary”
  • Allow open licenses to override Legal Deposit restrictions

3.3.4. Collaborate across the sector

  • Don’t reinvent the wheel
  • Don’t hoard expertise
  • Involvement in RDA, AI4LAM, LIBER Data Science for Libraries, …

3.3.5. Research & development

  • What infrastructure is needed to collect datasets at a national scale?
  • How can we facilitate computational access under Legal Deposit law?
  • Where does software fit in?
    • To facilitate access to data
    • As a first-class part of the scholarly record

4. What do you think?

  • What is the role of a National Library?
  • What do you & your users need?

5. Thanks!

  • Rachael Kotarski; Torsten Reimer
  • Blanka Matkovic; Hannah Liebeschuetz; Research Infrastructure Services
  • Digital Scholarship; Research Development; BL Labs; UK Web Archive; Curators and Metadata Specialists
  • Amongst many others…

6. Any questions?