Add pages for volume v264

mlresearch · Jan 28, 2025 · eeafd69 · eeafd69
commit eeafd69
Show file tree

Hide file tree

Showing 18 changed files with 1,037 additions and 0 deletions.
diff --git a/FMEduAssess2024.bib b/FMEduAssess2024.bib
diff --git a/FMEduAssess2024_corrected.bib b/FMEduAssess2024_corrected.bib
diff --git a/Gemfile b/Gemfile
@@ -0,0 +1,15 @@
+source "https://rubygems.org"
+
+git_source(:github) {|repo_name| "https://github.com/#{repo_name}" }
+
+gem 'jekyll'
+
+group :jekyll_plugins do
+  gem 'github-pages'
+  gem 'jekyll-remote-theme'
+  gem 'jekyll-include-cache'
+  gem 'webrick'
+end
+
+# gem "rails"
+
diff --git a/README.md b/README.md
@@ -0,0 +1,25 @@
+# PMLR 264
+
+To suggest fixes to this volume please make a pull request containing the changes requested and a justification for the changes.
+
+To edit the details of this conference work edit the [_config.yml](./_config.yml) file and submit a pull request.
+
+To make changes to the individual paper details, edit the associated paper file in the [./_posts](./_posts) subdirectory.
+
+For details of how to publish in PMLR please check https://proceedings.mlr.press/faq.html
+
+For details of what is required to submit a proceedings please check https://proceedings.mlr.press/spec.html
+
+
+
+Published as Volume 264 by the Proceedings of Machine Learning Research on 28 January 2025.
+
+Volume Edited by:
+  * Sheng Li
+  * Zhongmin Cui
+  * Jiasen Lu
+  * Deborah Harris
+  * Shumin Jing
+
+Series Editors:
+  * Neil D. Lawrence
diff --git a/_config.yml b/_config.yml
@@ -0,0 +1,93 @@
+---
+booktitle: Proceedings of Large Foundation Models for Educational Assessment
+shortname: FM-EduAssess2024
+year: '2024'
+volume: '264'
+start: &1 2024-12-15
+end: 2024-12-16
+published: 2025-01-28
+sections:
+- name: Preface
+  title: Preface
+- name: Contributed Papers
+  title: Contributed Papers
+layout: proceedings
+series: Proceedings of Machine Learning Research
+publisher: PMLR
+issn: 2640-3498
+id: FM-EduAssess2024
+month: 0
+cycles: false
+bibtex_editor: Li, Sheng and Cui, Zhongmin and Lu, Jiasen and Harris, Deborah and
+  Jing, Shumin
+editor:
+- given: Sheng
+  family: Li
+- given: Zhongmin
+  family: Cui
+- given: Jiasen
+  family: Lu
+- given: Deborah
+  family: Harris
+- given: Shumin
+  family: Jing
+title: Proceedings of Machine Learning Research
+description: |
+  Proceedings of Large Foundation Models for Educational Assessment
+    Held in Vancouver, BC, Canada on 15-16 December 2024
+
+  Published as Volume 264 by the Proceedings of Machine Learning Research on 28 January 2025.
+
+  Volume Edited by:
+    Sheng Li
+    Zhongmin Cui
+    Jiasen Lu
+    Deborah Harris
+    Shumin Jing
+
+  Series Editors:
+    Neil D. Lawrence
+date_str: 15--16 Dec
+url: https://proceedings.mlr.press
+author:
+  name: PMLR
+baseurl: "/v264"
+twitter_username: MLResearchPress
+github_username: mlresearch
+markdown: kramdown
+exclude:
+- README.md
+- Gemfile
+- ".gitignore"
+plugins:
+- jekyll-feed
+- jekyll-seo-tag
+- jekyll-remote-theme
+remote_theme: mlresearch/jekyll-theme
+style: pmlr
+permalink: "/:title.html"
+ghub:
+  edit: true
+  repository: v264
+display:
+  copy_button:
+    bibtex: true
+    endnote: true
+    apa: true
+  comments: false
+volume_type: Volume
+volume_dir: v264
+email: ''
+conference:
+  name: Large Foundation Models for Educational Assessment
+  url: https://neurips2024edu.github.io/
+  location: Vancouver, BC, Canada
+  dates:
+  - *1
+  - 2024-12-16
+analytics:
+  google:
+    tracking_id: UA-92432422-1
+orig_bibfile: "/Users/neil/mlresearch/v264/FMEduAssess2024_corrected.bib"
+# Site settings
+# Original source:  /Users/neil/mlresearch/v264/FMEduAssess2024_corrected.bib
diff --git a/_posts/2025-01-28-bleiweiss25a.md b/_posts/2025-01-28-bleiweiss25a.md
@@ -0,0 +1,42 @@
+---
+title: A Large Foundation Model for Assessing Spatially Distributed Personality Traits
+abstract: We explored emulating textually encoded personality information in a large
+  language model. Given its predominant empirical validation, we chose the five-factor
+  model of personality compiled for a broad range of natural languages. Our study
+  assessed personality traits from a multicultural viewpoint over a diverse set of
+  thirty universal contexts. Thus, contributing to the wider comprehension of generalizing
+  relationships among personality traits across cultures. We administered psychometric
+  tests to the language model, examined links between location and personality, and
+  cross validated measures at various levels of trait hierarchy.
+section: Contributed Papers
+layout: inproceedings
+series: Proceedings of Machine Learning Research
+publisher: PMLR
+issn: 2640-3498
+id: bleiweiss25a
+month: 0
+tex_title: A Large Foundation Model for Assessing Spatially Distributed Personality
+  Traits
+firstpage: 173
+lastpage: 185
+page: 173-185
+order: 173
+cycles: false
+bibtex_author: Bleiweiss, Avi
+author:
+- given: Avi
+  family: Bleiweiss
+date: 2025-01-28
+address:
+container-title: Proceedings of Large Foundation Models for Educational Assessment
+volume: '264'
+genre: inproceedings
+issued:
+  date-parts:
+  - 2025
+  - 1
+  - 28
+pdf: https://raw.githubusercontent.com/mlresearch/v264/main/assets/bleiweiss25a/bleiweiss25a.pdf
+extras: []
+# Format based on Martin Fenner's citeproc: https://blog.front-matter.io/posts/citeproc-yaml-for-bibliographies/
+---
diff --git a/_posts/2025-01-28-deroy25a.md b/_posts/2025-01-28-deroy25a.md
@@ -0,0 +1,55 @@
+---
+title: 'MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question
+  Generation'
+abstract: Automatic question generation is a critical task that involves evaluating
+  question quality by considering factors such as engagement, pedagogical value, and
+  the ability to stimulate critical thinking. These aspects require human-like understanding
+  and judgment, which automated systems currently lack. However, human evaluations
+  are costly and impractical for large-scale samples of generated questions. Therefore,
+  we propose a novel system, MIRROR (Multi-LLM Iterative Review and Response for Optimized
+  Rating), which leverages large language models (LLMs) to automate the evaluation
+  process for questions generated by automated question generation systems. We experimented
+  with several state-of-the-art LLMs, such as GPT-4, Gemini, and Llama2-70b. We observed
+  that the scores of human evaluation metrics, namely relevance, appropriateness,
+  novelty, complexity, and grammaticality, improved when using the feedback-based
+  approach called MIRROR, tending to be closer to the human baseline scores. Furthermore,
+  we observed that Pearson’s correlation coefficient between GPT-4 and human experts
+  improved when using our proposed feedback-based approach, MIRROR, compared to direct
+  prompting for evaluation. Error analysis shows that our proposed approach, MIRROR,
+  significantly helps to improve relevance and appropriateness.
+section: Contributed Papers
+layout: inproceedings
+series: Proceedings of Machine Learning Research
+publisher: PMLR
+issn: 2640-3498
+id: deroy25a
+month: 0
+tex_title: 'MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question
+  Generation'
+firstpage: 3
+lastpage: 32
+page: 3-32
+order: 3
+cycles: false
+bibtex_author: Deroy, Aniket and Maity, Subhankar and Sarkar, Sudeshna
+author:
+- given: Aniket
+  family: Deroy
+- given: Subhankar
+  family: Maity
+- given: Sudeshna
+  family: Sarkar
+date: 2025-01-28
+address:
+container-title: Proceedings of Large Foundation Models for Educational Assessment
+volume: '264'
+genre: inproceedings
+issued:
+  date-parts:
+  - 2025
+  - 1
+  - 28
+pdf: https://raw.githubusercontent.com/mlresearch/v264/main/assets/deroy25a/deroy25a.pdf
+extras: []
+# Format based on Martin Fenner's citeproc: https://blog.front-matter.io/posts/citeproc-yaml-for-bibliographies/
+---
diff --git a/_posts/2025-01-28-gao25a.md b/_posts/2025-01-28-gao25a.md
@@ -0,0 +1,67 @@
+---
+title: 'Towards Scalable Automated Grading: Leveraging Large Language Models for Conceptual
+  Question Evaluation in Engineering'
+abstract: This study explores the feasibility of using large language models (LLMs),
+  specifically GPT-4o (ChatGPT), for automated grading of conceptual questions in
+  an undergraduate Mechanical Engineering course. We compared the grading performance
+  of GPT-4o with that of human teaching assistants (TAs) on ten quiz problems from
+  the MEEN 361 course at Texas AThis study explores the feasibility of using large
+  language models (LLMs), specifically GPT-4o (ChatGPT), for automated grading of
+  conceptual questions in an undergraduate Mechanical Engineering course. We compared
+  the grading performance of GPT-4o with that of human teaching assistants (TAs) on
+  ten quiz problems from the MEEN 361 course at Texas A&M University, each answered
+  by approximately 225 students. Both the LLM and TAs followed the same instructor-provided
+  rubric to ensure grading consistency. We evaluated performance using Spearman’s
+  rank correlation coefficient and Root Mean Square Error (RMSE) to assess the alignment
+  between rankings and the accuracy of scores assigned by GPT-4o and TAs under zero-
+  and few-shot grading settings. In the zero-shot setting, GPT-4o demonstrated a strong
+  correlation with TA grading, with Spearman’s rank correlation coefficient exceeding
+  0.6 in seven out of ten datasets and reaching a high of 0.9387. Our analysis reveals
+  that GPT-4o performs well when grading criteria are straightforward but struggles
+  with nuanced answers, particularly those involving synonyms not present in the rubric.
+  The model also tends to grade more stringently in ambiguous cases compared to human
+  TAs. Overall, ChatGPT shows promise as a tool for grading conceptual questions,
+  offering scalability and consistency.
+section: Contributed Papers
+layout: inproceedings
+series: Proceedings of Machine Learning Research
+publisher: PMLR
+issn: 2640-3498
+id: gao25a
+month: 0
+tex_title: 'Towards Scalable Automated Grading: Leveraging Large Language Models for
+  Conceptual Question Evaluation in Engineering'
+firstpage: 186
+lastpage: 206
+page: 186-206
+order: 186
+cycles: false
+bibtex_author: Gao, Rujun and Guo, Xiaosu and Li, Xiaodi and Narayanan, Arun Balajiee
+  Lekshmi and Thomas, Naveen and Srinivasa, Arun R.
+author:
+- given: Rujun
+  family: Gao
+- given: Xiaosu
+  family: Guo
+- given: Xiaodi
+  family: Li
+- given: Arun Balajiee Lekshmi
+  family: Narayanan
+- given: Naveen
+  family: Thomas
+- given: Arun R.
+  family: Srinivasa
+date: 2025-01-28
+address:
+container-title: Proceedings of Large Foundation Models for Educational Assessment
+volume: '264'
+genre: inproceedings
+issued:
+  date-parts:
+  - 2025
+  - 1
+  - 28
+pdf: https://raw.githubusercontent.com/mlresearch/v264/main/assets/gao25a/gao25a.pdf
+extras: []
+# Format based on Martin Fenner's citeproc: https://blog.front-matter.io/posts/citeproc-yaml-for-bibliographies/
+---
diff --git a/_posts/2025-01-28-lee25a.md b/_posts/2025-01-28-lee25a.md
@@ -0,0 +1,54 @@
+---
+title: 'Gemini Pro Defeated by GPT-4V: Evidence from Education'
+abstract: This study compared the classification performance of Gemini Pro and GPT-4V
+  in educational settings. Employing visual question-answering (VQA) techniques, the
+  study examined both models’ ability to read text-based rubrics and automatically
+  score student-drawn models in science education. We employed quantitative and qualitative
+  analyses using a dataset derived from student-drawn scientific models and NERIF
+  (Notation-Enhanced Rubrics for Image Feedback) prompting methods. The findings reveal
+  that GPT-4V significantly outperforms Gemini Pro regarding scoring accuracy and
+  quadratic weighted kappa. The qualitative analysis shows that the differences may
+  be due to the models’ ability to process fine-grained texts in images and overall
+  image classification performance. Even adapting the NERIF approach by further de-sizing
+  the input images, Gemini Pro seems unable to perform as well as GPT-4V. The findings
+  suggest GPT-4V’s superior capability in handling complex multimodal educational
+  tasks. The study concludes that while both models represent advancements in AI,
+  GPT-4V’s higher performance makes it a more suitable tool for educational applications
+  involving multimodal data interpretation.
+section: Contributed Papers
+layout: inproceedings
+series: Proceedings of Machine Learning Research
+publisher: PMLR
+issn: 2640-3498
+id: lee25a
+month: 0
+tex_title: 'Gemini Pro Defeated by GPT-4V: Evidence from Education'
+firstpage: 33
+lastpage: 60
+page: 33-60
+order: 33
+cycles: false
+bibtex_author: Lee, Gyeonggeon and Shi, Lehong and Latif, Ehsan and Zhai, Xiaoming
+author:
+- given: Gyeonggeon
+  family: Lee
+- given: Lehong
+  family: Shi
+- given: Ehsan
+  family: Latif
+- given: Xiaoming
+  family: Zhai
+date: 2025-01-28
+address:
+container-title: Proceedings of Large Foundation Models for Educational Assessment
+volume: '264'
+genre: inproceedings
+issued:
+  date-parts:
+  - 2025
+  - 1
+  - 28
+pdf: https://raw.githubusercontent.com/mlresearch/v264/main/assets/lee25a/lee25a.pdf
+extras: []
+# Format based on Martin Fenner's citeproc: https://blog.front-matter.io/posts/citeproc-yaml-for-bibliographies/
+---