-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annotation workflow v2 #19
Comments
We should probably own user creation so we have some visibility into who's making changes |
I'll try to find some time soon (probably this week or next) to help migrate the instance over to digital ocean. |
thanks, @nfmcclure ! Much appreciated. Our pattern for other apps is to have a github repo on our org, set to auto-deploy over on DO. I'm guessing you won't have the permissions you need, and PDAP staff are happy to pick up where you get stuck. |
@nfmcclure a nudge, any update on the source code? we're happy to do any migration work, if that would help. |
@josh-chamberlain I'm willing to help tackle this. I have a few things I'll need to fully complete the task.
|
As I understand it, this annotation workflow will include several components
|
Label Studio can be configured to accept Source (Input) and Target (Output) Databases. Unfortunately, the non-local file options available for both are rather limited. I would need to know what our preferred method of storage would be, and what information I would need in order to properly load preprocessed data into the Source Database The process for setting up Label Studio on Digital Ocean seems fairly simple.
Additional, lower priority tasks include:
My Next Task
Current Blockers (cc: @josh-chamberlain )
|
#38 is a draft pull request for the preprocessing pipeline, which would be included in the data source identification repository. Being separate from the rest of the logic in that repository, it could be easily moved elsewhere. However, because it's associated with this issue within the repo, I'm keeping it here for now. The pipeline is relatively simple -- it loads the data from the relevant source, harvests the html data for each url and groups each with their respective urls, and then uploads that data to the relevant target. The source and target classes are designed to be easily substituted for the actual classes, and exist currently as placeholders and to guide the final implementation. Once I obtain the information about the source and target databases discussed in the blockers above, I will be able to complete this and submit it as a full pull request. My Next Tasks
|
@maxachis yes, setting up in DigitalOcean should be pretty simple! In general we'll get things merged into a GitHub repo and working locally across our different machines, then Marty will push to a DO droplet. I think it's up to @mbodeantor how to set up the target and source databases—I'm open to discussing this tomorrow if needed. |
@maxachis @josh-chamberlain Yeah I would like to discuss, I'm unclear about the use case of Label Studio vs Doccano |
@mbodeantor either way, we'll need to start from scratch—we can't get hold of the source code for Doccano, though from what I understand we had a pretty vanilla implementation. The key difference is that Labelstudio allows for labeling richer content like embedded images / rendered html, whereas Doccano is text-only. Labelstudio also seems to have a more mainstream user base, better documentation, etc. |
label_studio_config.zip |
@maxachis nice, let's wait for @mbodeantor to make sure he's on board with using Label Studio. |
Currently working on setting up Droplet on Digital Ocean. Outstanding tasks.
|
Label Studio version is online, of sorts. Can be accessed at http://167.71.177.131:8080 Hitting a possible blocker here in terms of Role Based Access Control -- specifically, the free version of Label Studio doesn't have it. That means that anyone who accesses our Label Studio instance can, if they are unscrupulous, change any settings they want in the project. Now if we're comfortable with the honor system, that's not a problem. If we are, that would require some different restructuring. The enterprise version of Label Studio would avert this problem, and would allow other functionality as well. However, the pricing for Enterprise is unknown -- we'd have to contact sales about it. @josh-chamberlain My current question is as follows:
|
@maxachis interesting. I reached out to them for pricing info. Can we limit access to the entire instance, to at least control who can sign up? Allowing anyone to sign up is scary. If we have some control over who can sign up, that's different. If people have to come to us to get an account, I am comfortable with honor system for now—provided we periodically extract labeled content in case someone messes up / sabotages us. This worked just fine with Doccano. I'm strongly against creating our own annotation pipeline from scratch. We can find a different free/cheap one which has RBAC if necessary...we are not the first to face this problem. |
There might be ways to limit access to the entire instance, but that would probably require us implementing things that go beyond the capabilities of free Label Studio. One option could be finding a way to dockerize our particular configuration of Label Studio and then hand that out to volunteers. If we configure it to point to cloud target and source databases, then in theory they would just need to spin it up and get started. But this wouldn't stop them from sharing it around or enable us to track their activity. And it does add an extra step (or several) to getting someone to help with annotation. My personal opinion is that if we're willing to go to such length, then considering another annotator would probably be more effective for the effort. |
Additionally, in case this affects consideration of whether to use Label Studio, I will point out that the feature of displaying HTML would probably run into some problems if we tried to display the HTML of web pages we're looking at. Since sometimes the HTML of these pages rely on relative addressing, rendered HTML content absent the context of the web server it comes from might sometimes appear broken. Not always, but sometimes. |
@maxachis let's keep it simple and just try to annotate the text then: the URL plus meta and header content being collected by the tag collector. we can save displaying the page content for a future enhancement. |
regarding auth/users, I do have a call with labelstudio tomorrow to find out about enterprise pricing. I suggest: |
We have a labelstudio 2 week trial. Some things I'd like to test:
|
@josh-chamberlain What are the full suite of roles we're envisioning here? Based on the interface, I can see two right off the bat:
Any others I might be missing? |
I've created a simple URL taxonomy labeling task based on my original design, which anyone can try out easily enough. This is accessible on the project website as "URL Labeling and Annotation".
The relevant documentation on pre-generated predictions can be found here. The video example provided shows someone modifying a JSON file with a pre-existing prediction, which is then displayed in the annotation task as a pre-selected option. It does not show someone being able to accept or reject an option as though the annotation task has already been performed. Thus, I have a few questions which I will investigate but which are also worth asking to the Label Studio team during free trial check-in
I will note that if the answer to 1 is that we can't skip the annotation task directly to the review/accept/reject portion, we could nonetheless make a workaround -- for example, by displaying a URL, the predicted classification, and a binary Approve/Reject option. And, if I'm understanding things correctly, we can bypass the review process such that these pseudo-review annotations are not reviewed a second time. This would also allow us to create a workaround for 2 as well -- in this case, the pre-annotated data is treated as contextual information, and the label is the manual "Approve/Reject". |
Another point of interest is being able to integrate a machine learning backend with Label Studio and creating an automated active learning loop. This could synchronize well with #41, "Make training happen on digital ocean". I would need to investigate the implementation further, however. And I may benefit from @EvilDrPurple and @mbodeantor 's insight into the machine learning pipeline and how easily we could integrate that into a Digital Ocean/Label Studio union. I'm currently playing with their ML Loop example, and y'all can follow along with my forked version of the repo here if you're curious. |
@maxachis re: roles, 3. someone getting data into/out of label studio, or otherwise integrating with hugging face or the API |
Findings on Annotator and Reviewer RolesAdministrators can create project and indicate Assigning manual reviewers is fairly easy and intuitive. If I log in as annotator, I only see the projects I’m added to. Thus, for annotations, several steps need to be done for access:
User experience as an Annotator is very user-friendly. Simple as click and go. • Annotators can submit or skip. Similar experience with reviewers. However, depending on the setting, reviewers can also annotate, so pay attention to settings. I’d additionally note the project dashboard as providing useful information, such as how long it takes people to complete a task. Recommend looking at that for longer. |
I'll look into this next. As I said before, there is the option for machine learning integration, but there also appear to be simpler options that can involve either manual import/export of data, or else hooking it up to cloud-based storage options such as Amazon S3. |
Note that the number of data sources available for Cloud Source Database storage (both for Source and Target database) is limited to:
|
I've been able to set up a source data pipeline that can automatically pull in data for a particular project. A few observations:
|
I'll additionally point out that Label Studio has an API which seems like it could be useful, albeit with some limitations: This might make components such as setting up users, linking to specific projects, and so forth easier. UPDATE: Removed portion expressing uncertainty about whether we can directly assign roles to user via the API -- I have confirmed that we can. |
Looks like we can import data through the API: https://labelstud.io/api#tag/Import/operation/api_projects_import_create |
We can also export the data similarly through the API. These would probably be the better options to take, as opposed to hooking them up to cloud providers. Helps keep things more flexible. |
@josh-chamberlain @mbodeantor I have created a draft pull request at #47 that can serve as a proof-of-concept for demonstrating how to transfer data into and out of the project, utilizing the API. If we wish to go forward with Label Studio, this can be used as a starting point for further modifications. I'll next work on modifying the data to test import pre-annotated data, to simulate what could be done with a machine learning pipeline. |
Observations from trial
|
On the active learning functionality and deeper ML integrationIt is interesting, and I think we could benefit from utilizing it. By selecting only the samples our machine learning model is most uncertain about, we could solve the problem we've had of certain training data being underrepresented. However, that process would require a more complicated setup, and probably would benefit from having an active learning setup already developed. Thus, it might not be useful to explore right now, given the limited amount of time we have on this trial. Additionally, the documentation for the machine learning portion is lacking and in some cases appears to contain contradictions. For example: model.py in the Github repository for the Label Studio ML Backend includes two methods, |
Questions for Label Studio TeamI'll update this comment with additional questions as I progress:
|
Creating/Rotating UsersCan be done using the API. I've linked to the relevant commands
I updated my PR to include functionality for updating a member's role, as well as an integration test demonstrating this. |
We got some responses: It was nice meeting you today. To follow up on our conversation today, here are the answers to your questions:
|
I'm working on creating code that can convert our data into the requisite format for pre-annotations. Bear in mind, the data must be in a very precise format, which is not always optimally documented. I may also need @EvilDrPurple 's insight as to how the label data as output from the ML pipeline is currently represented, as I will need to know how to convert data from that format into Label Studio's bespoke format. |
Can confirm I've successfully been able to import pre-annotated data into Label Studio, which can then be reviewed and thereby bypassing the annotation stage. In other words, we can create a full pipeline with either unannotated or pre-annotated data. My next priority will be to create an example pipeline, using fake data, that people can run which can illustrate how it would work. I'll be putting aside a demonstration of the programmatic user rotation functionality (which we'd need to decide if we want to pursue) as well as the active machine learning, which is not Minimum Viable Product for this issue. |
I have created and linked #47, a draft PR that at the moment mainly exists to demonstrate the functionality of Label Studio and how it would look to utilize it in a (simplified) pipeline. @josh-chamberlain @mbodeantor I invite y'all to check it out and see
|
@josh-chamberlain Since it's been over a week, I wanted to additionally ping you on this, in case it got lost in the shuffle. |
@maxachis sorry about the delay, I wasn't getting notifications. I'm looking at this now. |
@maxachis I made a project called Labeling interface Aside from the fact that the actual It's easy enough to make these 3 separate labeling tasks—but I think it's better if each URL only goes through the pipeline once, because it takes time for someone to read and understand what they're looking at. |
@josh-chamberlain This interface is wayyy better looking and comprehensible than what I came up with, so no complaints there. I also have no issue with having this be one task, and in fact thinks it's probably considerably easier for the user as well that way. I'm also happy to close this issue. I think after this we'd just need to create one or two issues for the process of ETL'ing data into and out of this. |
Context
v1 Existing doccano instance
Our volunteer @nfmcclure made us a doccano instance, which helped us label hundreds of URLs. This is an update to that original code, or a fresh start if needed. We need to label more data sources, and our next version of the pipeline needs to be more user-friendly to answer as many volunteer questions as possible.
Doccano instance: http://35.90.222.49:8000/projects/1
Comment or DM for access.
Doccano alternatives
Since we're starting fresh, we should probably use something like labelstudio. It's more fully featured. They support labeling rendered HTML, not just full text. This could really help us label a wide variety of things.
Requirements
record type
Inaccessible
for pages which are inaccessible, broken, null, 404, URL resolves to a file location of a jpeg/image, languages other than english. These should be pruned from the training data.Calls for Service
andDispatch Logs
toCalls for Service & Dispatch Logs
Budgets & Finances
Not Criminal Justice Related
→Not relevant
;Poor Data Source
→Relevant, but not a data source
Individual Record
as a boolean which defaults to FALSE but can be TRUE if clicked during annotationrecord type
labelDocs
The text was updated successfully, but these errors were encountered: