Skip to content

Latest commit

 

History

History
963 lines (862 loc) · 75.5 KB

mtg_notes.md

File metadata and controls

963 lines (862 loc) · 75.5 KB

May 16, 2024 (Code4Lib Post-Conference Session)

  • Eric Phetteplace ran a workshop on Python4Lib at Code4Lib 2024 in Ann Arbor
  • Started with an open discussion where we talked about people's experience with Python and some general topics
    • Some folks were mainly familiar with running Python in notebooks, others were more familiar with running Python scripts
    • We spoke a bit about managing dependencies and tools like Pipenv/Poetry that help with this and abstract over virtual environments
    • We discussed asyncio and asynchronoous programming generally, when to use it, what types of problems it addresses, and CPU-bound (computation heavy) vs IO-bound (network/files heavy) tasks
    • Eric introduced his marcgrep CLI tool for searching MARC records
  • We worked through the c4l24-python4lib repo which has notebooks on several topics. The only topics we covered specifically were:
    • Jupyter Notebooks (the material was delivered as notebooks)
    • Pymarc and common usage patterns, the most foolproof ways to get and modify record information
    • Pandas and its fundamental concepts (DataFrames, Series), how to summarize loaded data, stopped after introducing how to filter via bracket expressions

April 30, 2024

  • David asked if anyone had experience with or knew of any automated discard assessment tools
    • Javier said he has 25,000 volumes to assess for discard
    • Tomasz said other groups may know more about these types of tools because tech services may not have responsibility for collections assessment. Reference librarians may know more about potential tools to use.
    • Sara Amato has used OCLC API “to look at WC holdings and compare also to HathiTrust and comparisons to other libraries in our group to help make decisions - not great for large scale projects but good for smaller lists. I don’t have the code up anywhere though… and it doesn’t have any item level data like circ.”
  • Tomasz asked if Pymarc will have a new release due to a change in how indicators are handled
    • Indicators will be a named tuple that can only have two positions rather than a list which could be of any length
    • Ed: No scheduled release, reluctant to introduce another major version with breaking changes
    • More discussion of the change is in the pymarc google group
  • Michael asked if anyone has experience working with APIs for wikimedia/wikimedia commons
    • He has copyright free newspaper images he would like to upload in bulk as PDFs (rather than image files which the other wikicommons tools can use)
    • Javier mentioned using the APIs to get data out of wikimedia commons but not to POST data
  • Tomasz asked about Michael’s involvement in movement to preserve Ukrainian cultural heritage materials after the start of the full scale invasion
    • Michael noted there are two parts to this preservation work:
      • SUCHO works on preserving publicly available materials
      • There is a separate effort to back up digital materials that are not publicly available
    • Michael mentioned Maryna Paliienko, a Fulbright Scholar from Taras Shevchenko University, whose project focuses on archives
  • Michelle asked for help figuring out why her API calls hang when she tries to upload large files
  • John asked if anyone had recommendations for tools to use to take messy data from google docs and publish it to a dashboard a couple of times a year
    • Has been looking at Streamlit and Pygwalker as potential options
      • Pygwalker has tableau-like display
    • Jeremy used streamlit for a project with Hopkins Marine Station: https://taxa.stanford.edu/
      • One issue he noted was that every time a user would interact with the dashboard it would completely reload
  • Michael mentioned stumbling across a tool called Discorpy and thought it may be of interest after discussion in last Python4Lib session about image cropping/manipulation
    • It is a tool for measuring lens distortion in a camera
  • Yamil mentioned he is learning about SeleniumBase

April 16, 2024

April 2, 2024

  • Charlotte and Tomasz have released a new version (1.0) of Bookops-Worldcat, a Python wrapper for the WorldCat Metadata API.
  • Lauren at Rice is working on a reclamation project, gave a shoutout to Rebecca for some python notes she shared in the past.
  • Rebecca talked about her recent work using Tkinter. She has been changing code written using PySimpleGUI to Tkinter after PySimpleGUI changed their licensing and would require a fee for higher ed use.
  • Emily had a question about using pymarc for some batch edits, but it did not work as she hoped(?)
    • “At my institution, we’ve got one person (me) identifying OCLC numbers for changes in one, now pymarc script, that a second person then feeds into the Metadata API 2.0 to make changes. Using the BookOps library would we be able to integrate the script searching for identifiers with the script that makes batch changes?”
  • Charles shared a new project he and Eddie are working on using Flask to connect to the Alma API
  • Javier asked about Charles' use of ChatGPT 4, if he could share reasons to justify the cost of chatGPT 4
    • Javier also asked about the various “personas” that Charles used.
    • Charles then explained how to give “context” to each “persona.” Like stating that the human users is already experienced in programming.
    • Charles also mentioned that he asks chatGPT questions that chatGPT may need answered before it can properly answer a particular prompt (or all prompts going forward for a single “persona”)
    • Charles also recommended other LLMs that worked well for him for code questions if you cannot pay for ChatGPT 4 (some of the ones below have paid versions too)

March 19th, 2024

March 5th, 2024

  • Rebecca mentioned that Pysimple GUI has moved to a license model and was wondering if it is common for a package to move to a closed license
    • Clinton mentioned he has seen it maybe 5 times
    • It makes projects very brittle because every person needs to get a key annually
  • We discussed alternatives to PySimpleGUI
  • Tomasz mentioned that python isn’t really known for windows apps especially because TKinter is part of the standard library but looks very dated
  • Rebecca asked how to ensure that one won’t be burned in the future
    • Clinton suggested focussing on tools with very wide adoption (like Flask or Django)
    • Tools that are widely used can’t make that sort of change without it being too disruptive
  • If anyone would like to evaluate any of these tools and present on their findings it would be a welcome presentation
  • Rebecca mentioned a self-checkout tool that she is developing and asked for feedback
    • She is working with a group within CUNY to develop this tool
    • It will run in a terminal where someone could enter their User ID and check out a book
  • Charlotte asked for feedback on bookops-worldcat
  • David mentioned that he and Lauren are working on an OCLC reclamation using bookops-worldcat
  • Clinton offered to present on creating simple APIs in the future
  • Kate asked about adding 758 fields to ILS records
    • She is exploring adding them to their collection in a batch

February 20th, 2024

(Missing notes from Jeremy's presentation on pyscript)

February 6, 2024

  • Upcoming scheduled presentations/chats:
    • Jeremy Nelson will talk about pyscript on Feb 20
    • Charlotte and Yamil will be talking virtual environments on Mar 19
  • Rebecca recently gave a chat about something she built with PysimpleGui
    • there will be a video of this soon
  • Michael went over how he solved his PDF batch change issue by using pikePDF
    • He just wanted to batch change some simple low level PDF file metadata like the “author” field for the whole PDF file, but pikePDF can do a lot more with PDFs
    • He mentioned how PDFs save file metadata in two ways, but pikePDF helps him access either
    • He also mentioned an older Perl based tool called exiftool that is good for grabbing file metadata info
  • He fired up the Pycharm python IDE and ran the debugger on some sample code to show us some issues that he initially had, but has since solved
      from pikepdf import Pdf
    
      with Pdf.open('original.pdf') as pdf:
        with pdf.open_metadata() as meta:
          del meta['dc:description']
          del meta['pdf:Keywords']
        pdf.save('clean.pdf')
    
    
  • Yamil mentioned the upcoming PyCon 2024, and mentioned the $100 online only registration option. Also the videos will be posted on their Youtube channel after a month or so.
  • David asked about any new projects people have started with Python lately
    • He mentioned that he is teaching a colleague to update OCLC holdings with Python using the OCLC Metadata API
    • He also mentioned bookops-worldcat, Tomasz's library that acts as an “wrapper” for use with the OCLC Metadata API
      • “... Bookops-Worldcat is a Python wrapper around OCLC’s Worldcat Metadata API which supports changes released in the version 1.1 (May 2020) of the web service. The package features methods that utilize search functionality of the API as well as read-write endpoints. The Bookops-Worldcat package simplifies some of the OCLC API boilerplate, and ideally lowers the technological threshold for cataloging departments that may not have sufficient programming support to access and utilize those web services. Python language, with its gentle learning curve, has the potential to be a perfect vehicle towards this goal. ...”
    • David said he will share some sample code to show how he uses the OCLC Metadata API to update holdings with Python
  • Alison asked if anyone has successfully used Alma APIs and scripting to bulk change loan due dates for expired patrons
    • Alma doesn’t automatically do this when patron expiration dates change, which is a huge issue.
      • Rebecca: I haven’t changed loan dates but I have done other small things with the user/fulfillment API so far
      • Matt: I’ve used Python & the API once or twice to make bulk change due dates for specific users, but it’s been a while. Should be possible to do what you’re asking, though
      • David: I think our systems librarian does something like that at the end of the semester or FY. I can check with him and see if there’s anything he’d be willing to share.

January 23, 2024

  • Mike was having issues making bulk edits to the built-in metadata (eg. author) in PDF files using the pypdf module
  • David mentioned that his library is migrating into Ex Libris Alma/Primo in the near future.
    • He asked about existing Alma API wrappers you use and if anyone had experience using them
    • No one had suggestions for an API wrapper for Alma but many suggested he ask on the various Code4lib Slack channels
    • There is a possibly outdated project UC David from 5 years ago
  • Clinton put in a plug for using Postman to quickly use APIs
    • https://www.postman.com/
    • Craig also suggested Insomnia as an alternative for working with APIs manually
    • We may try to have a presentation in this group on the very basics of Postman in the future
  • David E. asked about how folks have been using chatGPT for coding python
    • Many folks had success with writing code with chatGPT, but chatGPT does not know a lot about some technologies
      • It doesn't know some details of OpenSearch and has invented functions in PyMARC when asked
    • HuggingChat was suggested as a better alternative to chatGPT, since it has a more recently updated model
      • ChatGPT’s 3.x model is from 2021 and HuggingChat's model is supposed to be newer
      • it has an option to “search the web” that, when enabled, will try to compliment its answers with information queried from the web
    • Eric has used chatGPT for creating unit tests with more advanced features like “test parameterization”
  • Eric mentioned that he proposed a post-conference session at Code4lib 2024 for this group (python{4}lib)
    • He asked for topic suggestions and volunteers
    • The session will happen in the morning
  • David E. asked if folks are starting new projects that will necessitate using python to finish the projects
  • Daniel asked for suggestions for PAID software for digital humanities, since they have a budget for it

January 9, 2024

John Dewees, DAM Lead at the University of Rochester, gave a presentation on the pax-opex-utility pax-opex-utility is "a graphical utility to format PAX objects and OPEX metadata for ingest into Preservica as SIPs to be synced with ArchivesSpace"

  • He used a PySimpleGUI utility to create a Windows executable
    • https://www.pysimplegui.org/en/latest/
    • the pax-opex-utility only works on Windows at this time
    • from David E.:
      • One thought on implementing on Mac vs. PC: I think there are different pathing formats/norms to follow. Depending on users they may need to make some adjustments if certain paths are hard coded. (I’ve made that an issue for myself by cleverly coding between a laptop and work PC.)
  • someone asked about libraries that can be used to package up assets for Archivematica and libraries that can be used to work with metadata in ArchivesSpace
  • Tomasz asked how is this software “shipped” to users
    • John said the users download software from the software’s Github repo’s release section
  • Someone asked if the code had unit tests, and some were not familiar with unit tests
  • We talked about how to save credentials in your OS and not in the app
    • Tomasz mentioned a Python module that can help with this:
      • “The Python keyring library provides an easy way to access the system keyring service from python. It can be used in any application that needs safe password storage. These recommended keyring backends are supported:”
        • macOS Keychain
        • Freedesktop Secret Service supports many DE including GNOME (requires secretstorage)
        • KDE4 & KDE5 KWallet (requires dbus)
        • Windows Credential Locker
  • We talked about how to handle using paths in your code to work in more than one OS

Screenshots from John's presentation: pax-opex1 pax-opex2 pax-opex3

December 13, 2023

We briefly talked about “for … else” construct that was recently mentioned in the #python Slack channel

  • I have only used it once, but I was very confused the first time I saw it

“This is a summary of what features appeared in which versions of Python.”

We talked about using Google Colab as a way to try to run a python script with more resources than on your local machine. For example, you may be able to tap into GPUs with Google Colab.

  • “Colab is a hosted Jupyter Notebook service that requires no setup to use and provides free access to computing resources, including GPUs and TPUs.”

Someone asked about running Python or non-Python projects on Digital Ocean, some have used it and were happy with them. I use the Digital Ocean help docs for Unix/shell and even Python topics quite often

Daniel talked briefly about a new project called jupyter-ai (and gave a live demo)

We spoke about doing quick python tests or experiments with a local Jupyter notebook

Book suggestion from John:

  • I’ve just started this book to try and build more programming practice into my workday: Python Workout: 50 ten-minute exercises
  • It’s included on O’Reilly if you have an institutional subscription.

On the topic of new things we have tried lately

  • I finally started using the coverage.py Python module
  • “Coverage.py is a tool for measuring code coverage of Python programs. It monitors your program, noting which parts of the code have been executed, then analyzes the source to identify code that could have been executed but was not.”

We spoke about coming up with new years python learning resolutions, or 7 days of code challenge

Also the group was asked if we should continue to have a mix of scheduled presentations and free chat time

  • the group would like to keep this mix

We talked about John’s earlier idea (from Slack) about finding if there any Python related presentations meant for Code4lib that were not accepted (or accepted) that could be given during this Python group meetings for those that cannot attend Code4lib

Tomasz mentioned how he suddenly found out that distutils (https://docs.python.org/3.10/library/distutils.html) was removed from the new Python 3.12 release

  • distutils is deprecated with removal planned for Python 3.12. See the What’s New entry for more information.”
  • we talked a bit about how Python does remove features, but it tries to give “deprecation warnings” and a year or so before a feature/module is removed
  • “You get what you pay for” reminds me of this: https://xkcd.com/2347
    • Susan: That xkcd reminds me of the node.js/javascript library whose developer yanked it from all the public repos a few years back, and it broke basically everything. Was it underscore?

The removal of that distutils module led to a discussion about Python virtual environments (also known as a venv which is the Python built-in module’s name)

  • by default the virtual environment works with whatever is the single python version is installed on your OS
  • you still need to set up separate python version (and there are multiple ways for that[1]) to have a virtual environment and also have it run a different version of Python locally
  • these are 2 ways (of several) to have more than one version of Python with tricks like
  • https://github.com/pyenv/pyenv
  • Docker
  • this group may have a future presentation on Python virtual environments (Yamil and Charlotte agreed to present on the topic)

November 28, 2023

Michael Benowitz, a Tech Lead at the NYPL, gave a presentation on Airflow. "Apache Airflow is a platform created by the community to programmatically author, schedule and monitor workflows.” Link to slides will be forthcoming, I will include screenshots of a few of the slides in the meantime.

  • Wikipedia article on Airflow
  • It is a free and open source product, but typically needs to run on a central VM/server for production use. Instead of just running on your own workstation. There are “cloud” providers for handling the hosting for you.
  • Airflow can be part of an ETL workflow
  • Airflow can be easy to schedule compared to older tools like cron, and it comes with a GUI

Airflow cloud options:

Additional technologies used and/or mentioned:

Screenshots of Mike's presenations: airflow1 airflow2 airflow3 airflow4 airflow5 airflow7

November 14, 2023

  • We talked about the MARC21 standard, how each record has a max size of 99,999 bytes/octets, and that individual fields can only have a maximum of 9,999 bytes/octets in size https://www.loc.gov/marc/specifications/specrecstruc.html
  • I then shared a Python pymarc snippet that inspired this size talk, that processed a large 80k record MARCXML file export to find if any individual records were larger than 99,999 bytes/octets https://pymarc.readthedocs.io/en/latest/
  • I was happy to find a convenient pymarc method that reads in MARCXML files and returns a Python list of individual pymarc records
records = pymarc.marcxml.parse_xml_to_array('myfile.xml')
  • though this method loads all data in RAM and could seriously impact your computer performance if you don’t have a lot of RAM available there are other functions and approaches to only load a few XML records at a time the resulting code found 4 records in our data then there was a question about how hard it is to use pymarc to analyze subject data in a batch of records we then shared a few more examples of how simple it can be to use pymarc and how general knowledge of Python concepts like looping through lists and using conditional statements goes a long way to make it easy to use pymarc see image of Eric’s example of using pymarc code that was shared

marc1

marc2

October 31st, 2023

  • Introductions, refreshing memories of returning attendees and new attendees; common threads from intros:
    • Alma
    • OCLC API (APIs in general)
    • Archivespace
  • John Dewees Question on CSVs - Generally how big is too big for python to handle CSVs? Is there a moment where something is too big to be ingested and handled properly?
  • John Pillbeam mentioned SQLite might work well here which is sort of a file on disk and is adaptable for quite a bit of operations.
  • Bruce Orcutt mentioned SQLite might be the best way to go as well, though think of the upfront maintenance.
  • Paul Clough mentioned you may need an Object Relational Mapping (ORM) in front of the SQLite. It helps translate between the application and its needs (abstracts it out.)
  • Emily Frazier mentioned using a python script which loads 8 million rows of a TSV into pandas. It worked but was a bit slow.
  • Rebecca Hyams mentioned an Alma project which helps draw out certain elements of MARC data. You can get really granular from API. ENUG Presentations including Rebecca’s presentation on item/inventory and PySimpleGUI
  • Comments about documenting projects. Susan mentioned good comments in code and a narrative of it in a separate word doc.
  • Constellate was asked after by Bruce.
  • John Pillbeam linked to the courses/workshops at constellate.org/events.
  • John P. Linked to another course by one of the constellate devs. Currently going through this free online course/textbook that one of the Constellate trainers created: https://pandas.pythonhumanities.com/

October 17th, 2023

  • we talked about FRBR
    • Talked about record-rollups
  • Susan mentioned that she started working through Adam Emery’s “Learn Python” tutorials.
  • Eric has recently liked working with the Spacy site to learn about Natural Language Processing (NLP)
  • Yamil liked the tutorials that this site has, since you can run examples right on their site without having to install anything locally
  • Susan later asked if they should use a locally installed version of Python or use Jupyter notebooks for her first real project
    • John: the consensus that it is better to have a locally installed version
    • though Jupyter notebooks or Google Colab can be great to practice or prototype things
  • “I just discovered this via that Glyph blog post - an updater for the python.org Mac installer: https://mopup.readthedocs.io/en/latest/
  • David shared a free online Python tutorial:
  • using pyenv to easily manage having more than one version of Python on yourhcieh
  • a few people mentioned that they are liking using Poetry for “packaging and dependency management”
  • John D. mentioned: “Just finished the official PySimpleGUI Udemy course and created my first graphical utility which has been fun”
    • This group may have a future session to demonstrate PySimpleGUI
    • Tomasz asked if folks knew about Python tools for “transliteration” of Non-Latin text
  • We went back to talking about tools for local development

October 3rd, 2023

  • Guest Speakers:
  • Side note, Charles works on an open source LibGuides alternative.
  • Some general chat about the nature of open source projects - great grassroots! Though it can be fragile/risky.
  • Some code generated by chat GPT for the basic LMS on the list of exercises Charles provided
  • Think small and tailor the items to the library discipline. Build upon one thing to the next?
  • GUI? Connect to WorldCat?
  • Carpentries lessons, link to git space? https://carpentries.org/community-lessons/
  • John Pillbeam mentioned the incubator for finding concepts that may not be included in main lesson plans yet.

September 19th, 2023

  • Ben asked how to tell others that say they want to use Python with AI, specifically with the chatGPT API
    • We spoke how there is some ability to run some API calls for free for version 3.5, though there is a cost for running API calls for the 4.x version
    • It was mentioned about the pricing for Hugging Face https://huggingface.co/pricing as an alternative
    • From David: Hugging Face also has a variety of tags around different areas of AI. So there’s the Natural Language Processing stuff, but ChatGPT is the big player there. But things like object detection and audio tools are there.
    • Yamil suggested running tutorials of the https://scikit-learn.org/stable/
      • Simple and efficient tools for predictive data analysis
      • Accessible to everybody, and reusable in various contexts
      • Built on NumPy, SciPy, and matplotlib
      • Open source, commercially usable - BSD license
    • Recent post from Simon Willison on Python and OpenAI tools: https://simonwillison.net/2023/Sep/12/llm-clip-and-chat/
    • We talked about concerns on the AI hype and over reliance of AI.
    • We very briefly spoke about NLP - Natural Language Processing., and how that is just a small part of the “engine” that is a platform like chatGPT
      • to try to learn NLP I ran some tutorials using the python module https://spacy.io/
      • spaCy is a free, open-source library for advanced Natural Language Processing(NLP) in Python.
      • If you’re working with a lot of text, you’ll eventually want to know more about it. For example, what’s it about? What do the words mean in context? ”
  • We spoke about Charle’s new repository with exercises to learn python skills
  • Tomasz talked about about issues with being a organizational customer of Naxos, which is a streaming audio/video content
    • For example, how to make sure the catalog is serving the correct sets of valid MARC files with also valid 856 tags that lead to the content
    • Here is a presentation on the pitfalls of keeping your holdings in sync with vendors
    • Everything is Broken, but by How Much Exactly (video)? (slides)
    • Tomasz would like to see if he can use Python to automate the process of keeping the holdings in sync. Meaning that MAC records for content that is no longer available via Naxos is deleted from the catalog in a timely manner
      • For example, doing some analysis with Pandas
  • Kate wrote:
    • Once we migrate to our new ILS (Symphony), we will eventually (hopefully!) start using their eResource Central system for all our eContent and be able to do away with MARC records for eContent. But for now we use a combination of extracting batches of records in order to use MarcEdit’s link checker or other link checkers, or just periodically wiping out all our MARC records for a particular vendor and loading a new batch from the vendor for all our holdings
    • We’re about to do that now with Axis 360 since they’ve switched to “Boundless”. We have over 30,000 MARC records for Axis 360, so just too much to handle
    • Mentioned the issues of trying to fix issues, in the large vendor MARC records that need to be added to our catalogs. For example, like misspellings or bad records
  • We spoke about about the limitations of licensing content from Naxos (or similar vendors) versus actually storing that content locally
  • Briefly mentioned the ongoing “Internet Archive lawsuit”
  • Here is an article about the lawsuit it that is a few weeks old This is an article from the New York Times that is several weeks old about the lawsuit a key quote from the article that we talked about “Libraries came before publishers,” the 62-year-old librarian said in a recent interview in the former Christian Science church in western San Francisco that houses the archive. “We came before copyright. But publishers now think of libraries as customer service departments for their database products.”

September 5th, 2023

  • Charles showed some code that batch creates APA & AMA citations
  • Carlos wanted feedback on how to add small improvements to their code that creates citations
    • for example, when then there is no volume number for a citation, how to elegantly not add a volume number
  • someone suggested to to use Python 3.10's “case” functionality that is formally called: “Structural Pattern Matching”
  • we briefly talked about how PEP stands for “Python Enhancement Request”
  • Here is a site with a brief explanation on how to use “Structural Pattern Matching” in Python 3.10 https://realpython.com/python310-new-features/#structural-pattern-matching
  • Eduardo, who works with Charles, mentioned that they are trying to figure out how to encode that some parts of the citation have to be in italic when using Pandas to batch create citations
  • Tom has this suggestion for dealing with citation data
  • Yamil talked about using “unittest” for a pre-existing python code base, but mentioned that you can keep older tests as unittest style and just add new tests that use pytest
  • we talked about “Library Carpentry” classes and how helpful they have been. They can cover various topics, including Python
    • https://librarycarpentry.org/index.html
    • “Library Carpentry focuses on building software and data skills within library and information-related communities. Our goal is to empower people in these roles to use software and data in their own work and to become advocates for and train others in efficient, effective and reproducible data and software practices. Our workshops are based on our lessons. ”
    • The umbrella organization for Library Carpentryincludes: Data Carpentry and Software Carpentry
  • Yamil was asked to briefly speak about a session at the Open Library Foundation’s (OLF) conference (WOLFCon) that covered the FOLIO ILS and the use of Python for post migration clean up by folks at Wellesley
  • this site was suggested for improving your Python skills, but other programming languages are supported
  • we spoke about Python community’s preferred writing style versus Ruby’s
  • We spoke about PEP8, which is the main Python style guide
  • spoke about Black, which can be used to change your code to match PEP8
    • “Black: The uncompromising code formatter”
    • We spoke about how the Pycharm Python editor is great about reminding you to follow PEP8 when you write your code and to also give the option to automatically reformat individual code snippets to follow PEP8, instead of just reformatting all of your code
    • Yamil also mentioned how I have opened up existing Python codebases in Pycharm, and the Pycharm indexer has found many hidden bugs in code that had never run or code that had logic flaws

August 22, 2023

... missing ... 😭

August 8, 2023

Our meet focused on Pydantic. Matt Lincoln from JSTOR Labs gave a brief intorduction into the tool and its uses.

Matt used this jupyter notebook to demo basic Pydantic syntax and validation functinality.

  • Data validation can be done using Python type hints
  • Fast and extensible, Pydantic plays nicely with your linters/IDE/brain. Define how data should be in pure, canonical Python 3.7+; validate it with Pydantic.
  • We briefly talked about wanted to review how to create classes and objects in Python in a future meeting.
  • Pydantic can help with IDE / editor auto complete / auto suggest
  • Pydantic hasa a x.json() function/method to serialize data to JSON
  • great for writing APIs
  • Pydantic has a x.schema() method (which uses JSON schemas)
    • the schema can then be used to create API documentation for using the API
  • FastAPI platform for Python based APIs uses Pedantic a lot
  • FYI: Pydantic version 2 is just coming out and some products/python modules that use Pydantic may still be not ready for version 2, but shoudl still support version 1 we also briefly talked about Python’s built in “data classes”
  • “In Python, a data class is a class that is designed to only hold data values. They aren’t different from regular classes, but they usually don’t have any other methods. They are typically used to store information that will be passed between different parts of a program or a system.”
  • we talked about that Pydantic is not a replacement of “JSON Schemas”, that Pydantic is a complimentary tool
  • talked about Pydantic validators and their application
  • we talked about briefly typing in Python in general, and how helpful it can be
  • questions for Matt:
    • is there any integration between pydantic and popular ORMs (like sqlalchemy for example)? Answer: yes, pydantic data classes should work well with most ORMs
    • can pydantic validation features be useful in format crosswalks when we do not care about JSON output? Answer: yes, although in some cases more strict and detailed validation may be required. Still out of-the-box validiton in pydantic would be very useful in Matt's opinion

July 25, 2023

  • Rebecca:
    • Inventory tool to active scan vs. lists, processes, & jobs https://github.com/LibraryNinja/alma_inventory_utility/tree/main
    • Utilizes: pysimplegui, auto-py-to-exe
    • Old method: Make a barcode set, run job on Alma to update
    • Problem of not really knowing if something wasn’t found or had a status (loan, out of place, etc.)
    • This is loosely based off of Jeremy Hobbs Lazy Lists utility to adapt to an inventory project. (https://github.com/MrJeremyHobbs/LazyLists)
    • Examines items in XML
    • Pulls in some basic information to confirm for users.
    • Indicates set aside for problematic titles (tech services would handle)
    • Used autopy-to-exe to allow student workers to run this small utility on their machines.
  • Julie:
    • Sierra had a shelflist/inventory but it was not really work well, so a python inventory tool is great!
    • Had used SQL lists to help scan/match with selenium
    • Tools for link checking?
    • Authentication with EZ Proxy
    • https://pypi.org/project/LinkChecker/
  • Charles:

July 11, 2023

Rough and incomplete summary of topics covered today’s (2023-07-11) in Python{4}Lib group meeting

  • we talked about TAP - Text Analysis Pedagogy classes

  • Eric mentioned the Python Wagtail CMS built on top of the Python Django software dev sponsored by Google

  • We briefly talked about using https://gunicorn.org/ Python WSGI HTTP to serve Python software like Django, Flask

  • Eric also mentioned about a Python based institutional repository, and how it compared to the PHP based Islandora digital repository

  • we talked about using http://docopt.org/ instead of using the Python built-in argparse module for parsing command line (CLI) parameters

  • We then talked about parsing ezproxy “audit” files with Python

  • then Eric shared a script that he created to parse a data file for the Koha ILS using docopt to parse the CLI parameters that are listed in the comments at the top of the file

  • We talked about how to improve your coding style before posting you Python code on Github or on the internet.

  • Yamil recommended this book which helped him write in more standard/professional Python style: “Beyond the Basic Stuff with Python / Al Sweigart”

  • We then talked about when to use the try: except:

    • Python syntax to catch exceptions, since folks often did not see try {...} being used a lot in other people code
    • some of us mentioned that we don’t use them all of the time but in some situations we always make sure to use them. For example, it is common to use try {...} when you are using a method that commonly raises exceptions.
    • Like in the Python Selenium module for writing “functional tests” for web pages. There are several Selenium methods that start with find_***() and can easily trigger an exception if what you are looking for in a webpage is not found. In this context I always use a try {...} statement around calls like find_element_by_css()
    • there is of course a lot more that can be said of when to use try {...} in your Python code
    • this chapter from the Beyond the Basic Stuff with Python” book, among many tips, includes how to use the built-in dictionary get() method that can be used to not accidentally trigger a KeyError exception when you try to access a Python dictionary’s key that does not actually exist
      • Writing Pythonic Code - Pythonic Ways to Use Dictionaries
      • using the get() dictionary method to avoid KeyError exceptions
    my_dict = {'username': 'joe'}
    my_dict.get['password'] # raises KeyError exception
    my_dict.get('password', False) # simply returns False, or whatever is placed in the 2nd parameter of get()

June 27, 2023

import requests

res = requests.get('https://automatetheboringstuff.com/files/rj.txt')
res.raise_for_status()

playFile = open('RomeoAndJuliet.txt', 'wb')

for chunk in res.iter_content(100000):
    playFile.write(chunk)
  • Also we mentioned networkX: NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.
  • Tomasz ask if anyone was doing any batch work on images with Python, to find a faster way to process a larger number of images. We talked about perhaps using multiprocessing for this.
    • Again from the book "Automating stuff with Python", Ch 19 talks about using the Pillow Python module batch change images
    • Also there should be ways to use very well known non-python library called ImageMagick, but controlled through Python, for batch making changes to images. Yamil has worked with many projects like the Drupal/PHP based Islandora project, that use ImageMagick for making changes to images

June 13, 2023

  • Python podcasts suggestion from Tomasz: PythonBytes
  • David talked about an new Python module called “Pandas AI” that did find useful if you have a paid chatGPT account
  • We talked about when we have used chatGPT to write some Python code snippets, and what were our results.
    • The results were mostly positive, but we talked about the benefits of already knowing Python well enough to formulate request more precisely and evaluate how well the chatGPT responses were
    • Someone mentioned that chatGPT has become as an alternative to StackOverflow, specially if you are in a hurry
    • Someone mentioned Github Copilot: “Those of us who have GitHub educator accounts have free access to Copilot. Have not tried it. Very reluctant, personally.” Which uses AI to write code for you.
    • Will StackOverflow become obsolete with the revolution in AI? Yamil thinks that it is a good inspiration for prompts, and still has great information
    • We saw an example of sharing a snippet of object oriented Python code to ask chatGPT to explain what is missing
    • One of the participants was glad to get the explanations from chatGPT of what was missing in their object oriented code
    • A recent Code4Lib article that talked about using AI generated code was shared “Utilizing R and Python for Institutional Repository Daily Jobs”
    • We briefly talked about the ethics of using AI written code that was trained on code that other published publicly on Github, but without their explicit consent
    • Podcast example crated by AI: “I’ve been listening to this series in the Planet Money podcast where they try to make an entire podcast episode made by AI:” https://www.npr.org/series/1178395718/planet-money-makes-an-episode-using-ai
    • Charles asked if anyone was using Python to automate work with the Azure cloud computing platform
  • We talked about a great site and free book that many people use to get started with Python “Python for Everyone”
  • We also talked about the well known and still very popular Python Requests module, and but also the newer and “async compatible” HTTPX module, which was also mentioned on the Python Slack channel.

May 30, 2023

  • David shared his code utlizing pymarc to harvest and clean OCLC records. An older example of code: https://github.com/derlandson/PyCat
  • Demo of Match MARC toolset as well.
  • Tomasz reported his first experiences using pymarc v.5
  • Discussed a potential pymarc feature ordering subfields accoding to a particular field cataloging practice
    • challenge: no clear, outlined rules to based it on
  • Rebecca demoed a script created to have circ desk staff click a single button for simple questions (directions, tech, find a book, etc.) Creates output file and emails results as csv once per month. Currently doesn’t need admin permissions but various features may impact this.

May 16, 2023

  • We had a brief discussion about pymarc and MARC authority data
    • sparked by Benjamin's issues with using pymarc for authority records
    • Tomasz run some quick tests and they looked good: pymarc was able to read such data, but more tests are needed to see if manipulating and writing is done correctly. There were concerns about differences in the leader field between the bibliographic and authority data

Ed Summers intro to new pymarc

  • David introduced Ed
  • Ed stated pymarc is work of many people, Ed's involvement is more of the maintainer
Breaking changes in pymarc v.5:
  • new class pymarc.Field.Subfield

  • helper properties instead of methods

    • old: record.title(), new: record.title
    • old: record.publisher(), new: record.publisher
  • automatically sets UTF-8 code in record leader in the position 9

    • pymarc always converts data to unicode, but before it did not attempt to change the code in the leader to reflect that
    • most people don't want to write MARC-8, and want UTF-8 encoded data
  • Ed shows off doing live coding! Uses Google Colab and Jupyter notebooks (tip: you can pip install packages in Colab: !pip install pymarc, the exclamation mark will tell the notbook cell in not a code but a command line script)

  • Ed shows initiating new record instance, and adding fields with the new model for subfields

  • Subfield is a python namedtuple

New:

from pymarc import Record, Field, Subfield

record = Record()
record.add_field(
    Field(
        tag="245",
        indicators=["0", "0"],
        subfields=[
            Subfield(code="a", value="Foo :"),
            Subfield(code="b", value=" bar /"),
            Subfield(code="c", value="Spam.")
        ]
    ))

or simply:

field = Field(
    tag="245",
    indicators=["0", "0"],
    subfields=[
        Subfield("a", "Foo :"),
        Subfield("b", "bar /"),
        Subfield("c", "Spam.")
    ])

old

record.add_field(
    Field(
        tag="245",
        indicators=["0", "0"],
        subfields=["a", "Foo :", "b", "bar /", "c", "Spam."]
    ))
  • New model has advantages over subfiels as a list of strings:

    • matches how cataloger's think about subfields - as code-value pairs (Tomasz)
    • helps guard against errors such as missing an element to properly create a subfield
  • discussed briefly differences between pymarc and similar Pearl library [MARC::Record]https://metacpan.org/pod/MARC::Record()

  • Ed showed a tip how to avoid malformed or otherwise invalid records when looping over a file: Ed errors looping over return None (malformed bibs, leader lenght problems, )

from pymarc import MARCReader

with open("foo.mrc", "rb") as marcfile:
    reader = MARCReader(marcfile)
    for record in reader:
        if record is None:
            print(reader.current_exception)
        else:
          # do something
  • talked about potential new features in pymarc, for example handling of linked 880 fields that include parallel data in non-Latin scripts

May 2, 2023

April 18, 2023

At today meeting we had @michelle.janowiecki give a short presentation on Pandas, partially based on a longer Pandas presentation she has given before. Speedy pandas : a super brief intro to Python's pandas library (see slides) Here are a couple of useful links from her presentation...

Pandas Official resources

Pandas Additional resources

Examples of the code Michelle demonstrated

import pandas as pd

filename = "sampleData.csv"
df = pd.read_csv(filename)
print(df.head())

print(df.columns)

degree_department = df["degree_department"]
department_unique = degree_department.unique()
print(department_unique)
unique_list = list(department_unique)
print(unique_list)
import pandas as pd

filename = "sampleData.csv"
df = pd.read_csv(filename)

print(df.shape)
df = df.dropna(axis=0, how="all")
df = df.dropna(axis=1, how="all")
df = df.drop_duplicates()
df["title"] = df["title"].str.strip()

print(df.head())
print(df.shape)

df.to_csv("sampleData_cleaned.csv", index=False)
import pandas as pd

df_1 = pd.read_csv("frame_1.csv")
df_2 = pd.read_csv("frame_2.csv")

merged = pd.merge(df_1, df_2, how="left", on="subject_id")
print(merged.head())

merged.to_csv("merged_frames.csv", index=False)

These are some of the Pandas features @michelle.janowiecki demonstrated today

  • drop_duplicates()
  • dropna()
  • merge()

After the presentation we all exchanged pandas usage tips

April 4, 2023

The mini-workshop "An Introduction to Python for Absolute Beginners":

A very basic intro to Python for librarians who have little to no experience with Python but who want to get started.

  • What is Python and why is it useful? (5 min)
  • Hands-on practice with basic operations in Python, using Google Colaboratory (25 min)
  • Print function
  • Data types
  • Arithmetic operations
  • String concatenation
  • Variable assignment
  • Q&A/Resources (15 min)

Notes

from google.colab import drive
drive.mount('/content/drive')

import pandas as pd

df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/data.tsv', sep='\t')

March 21, 2023

  • Talked about this group’s new repository, and that we want to encourage others to contribute changes via PRs (or reach out to the group)
  • Talked about combining JS and python for web visualization
  • Talked about if on macOS we should currently be using homebrew for installing Python on macOS
  • Talked about Library Carpentry lessons on Python and other skills like bash, OpenRefine
  • Spoke a bit about Google Collab, which are essentially Jupyter Notebooks in the cloud, no need for local installation
  • Pivoted to talk about interesting things seen in during Code4lib
    • the Python GUI package mentioned named Gooey
    • “There was a poster about updating subject headings as well. Which was something we had briefly talked about briefly a week before C4L.”
  • Touched on a suggested breaking change to pymarc, MR details
    • this change uses Python “namedtuples”
    • this change is welcome by many
    • We then covered how to use pymarc with authority records, as opposed to bibliographic records - more research needs to be done
  • NOTE: this Python group in the future plans to host a pymarc “code recipe” sharing session
  • Talked about current issues in pymarc with MARC bib tag 880

March 7, 2023

  • Introductions with a few new members
  • Move the Python{4}Lib resource page to a Code{4}Lib, thanks @klinga
  • @Rebecca Hyams working on an ELUNA Dev. Day presentation gathering specific holding data (granular) from Alma via API and parsing it via python script. Chat about maintaining authorities when you’ve decided to change from standard language. Is/should there be a tool to check for changes for authorities you select?
  • A project for a heat map visual for circulation might be a new way of helping to weed/collection develop.
    • Perhaps there's interest to have a working group dive into different projects. Could be helpful for design ideas.
  • Dashboards and/or developing scripts that can translate one form of data to another; identifying transformation steps and when to streamline them in one script vs. multiple.
  • IPEDS data transformations. A lot of data isn’t as streamlined as we’d like every time IPEDS comes up. Still quite local though. (Changes year to year?)