Basic Usage

Preface

Perspective

There are many ways to wield an API. This documentation explains from the perspective of recreating the frontend.

main_string vs valid_string

a vocabulary term can have multiple ways to refer to it. ‘mus musculus’ can also be called ‘mouse’. So, ‘main_string’ refers to the ‘most official’ way to reference something and ‘valid_string’ is a set of synonyms, which includes the ‘main_string’.

Curating a User’s Vocabulary

There are two resources associated with curating a user’s vocabulary. Their ordering depends on the logic of the frontend.

In our frontend, we try the automatic curation first, which is based on (the n-grams TF-IDF coupled with a nearest neighbors model) AND (property use_count, which measures if that term has been used before. we assume the same terms will be used more often). We do this via

a post request to :4999/predictvocabularytermsresource/

With a payload like

{
    “header”:”species”,
    “written_strings”:[“musk muskulus”,”homo”],
    “neighbors_to_retrieve”:100
}

The frontend then handles the “validation” (in our frontend we only offer the top match) by the user as well as the actual transforms on the current submitted sample-metadata matrix if they accept the proposal.

In our frontend, terms whose proposals are not accepted are given the chance to curated using substring matching. We do this via

a post request to :4999/generatesubstringmatchesresource

With a payload like

{
    "header":"species",
    "substring":"porcupi"
}

From this, a list of main_strings and valid_strings is generated. It is worth noting that our frontend blocks calls when the number of char is < than 3 for certain categories to prevent slowing. However, for certain categories, it always allows calls. For example, ‘m’ is a valid_string for ‘meter’.

These are the only tools for curation. The third step on the frontend is just submission of a new term, not matching to old terms

Submitting a Study

Steps Always Taken

There are two steps always taken upon completion of the curation process on our frontend. One is the updating of use_count and one is the submission of the study.

The use_count property is updated via

a post request to :4999/updateusecountresource/

With a payload like

{
    "header":"species",
    "main_string":"Bacteria"
}

we update use_count because it is an important aspect of getting the best term. We found that just the cosine of n-grams often led to close-but-not-quite answers. Maybe think of this as a sloppy Bayesian approach?

We update the main_string because we want to update the core vocabulary term. If you need the main_string and only have the valid_string, you can access the main_string via

a post request to 4999/retrievevocabrowsresource/

With a payload like

{
    "header":"organ",
    "valid_string":"kidney"
}

it is important to specify the header, because the same valid_string (and main string for that matter?) can appear in multiple instances. For example, DDT is a pesticide as well as a gene.

Finally, we always submit a completed study to the database. Again, the actual data submitted will come from the frontend which is orchestrating the transformation. Do via

a post request to 4999/addstudytodatabase/

With a payload like

{
    "provided_author_name": "Parker Bremer",
    "sample_metadata_sheet_panda":
    [
        {"species.0": "Homo sapiens", "organ.0": "Kidney", "cellLine.0": "not available", "cellCount.0": "not available", "mass.0": "5.0", "massUnit.0": "milligram", "drugName.0": "control", "drugDoseMagnitude.0": "not available", "drugDoseUnit.0": "not available"},
        {"species.0": "Homo sapiens", "organ.0": "Kidney", "cellLine.0": "not available", "cellCount.0": "not available", "mass.0": "5.0", "massUnit.0": "milligram", "drugName.0": "control", "drugDoseMagnitude.0": "not available", "drugDoseUnit.0": "not available"},
        {"species.0": "Homo sapiens", "organ.0": "Kidney", "cellLine.0": "not available", "cellCount.0": "not available", "mass.0": "5.0", "massUnit.0": "milligram", "drugName.0": "control", "drugDoseMagnitude.0": "not available", "drugDoseUnit.0": "not available"},
        {"species.0": "Homo sapiens", "organ.0": "Kidney", "cellLine.0": "not available", "cellCount.0": "not available", "mass.0": "5.0", "massUnit.0": "milligram", "drugName.0": "KERENDIA", "drugDoseMagnitude.0": "20.0", "drugDoseUnit.0": "milligram"},
        {"species.0": "Homo sapiens", "organ.0": "Kidney", "cellLine.0": "not available", "cellCount.0": "not available", "mass.0": "5.0", "massUnit.0": "milligram", "drugName.0": "KERENDIA", "drugDoseMagnitude.0": "20.0", "drugDoseUnit.0": "milligram"},
        {"species.0": "Homo sapiens", "organ.0": "Kidney", "cellLine.0": "not available", "cellCount.0": "not available", "mass.0": "5.0", "massUnit.0": "milligram", "drugName.0": "KERENDIA", "drugDoseMagnitude.0": "20.0", "drugDoseUnit.0": "milligram"},
        {"species.0": "Homo sapiens", "organ.0": "not available", "cellLine.0": "HEK293", "cellCount.0": "1000000.0", "mass.0": "not available", "massUnit.0": "not available", "drugName.0": "control", "drugDoseMagnitude.0": "not available", "drugDoseUnit.0": "not available"},
        {"species.0": "Homo sapiens", "organ.0": "not available", "cellLine.0": "HEK293", "cellCount.0": "1000000.0", "mass.0": "not available", "massUnit.0": "not available", "drugName.0": "control", "drugDoseMagnitude.0": "not available", "drugDoseUnit.0": "not available"},
        {"species.0": "Homo sapiens", "organ.0": "not available", "cellLine.0": "HEK293", "cellCount.0": "1000000.0", "mass.0": "not available", "massUnit.0": "not available", "drugName.0": "control", "drugDoseMagnitude.0": "not available", "drugDoseUnit.0": "not available"},
        {"species.0": "Homo sapiens", "organ.0": "not available", "cellLine.0": "HEK293", "cellCount.0": "1000000.0", "mass.0": "not available", "massUnit.0": "not available", "drugName.0": "KERENDIA", "drugDoseMagnitude.0": "20.0", "drugDoseUnit.0": "milligram"},
        {"species.0": "Homo sapiens", "organ.0": "not available", "cellLine.0": "HEK293", "cellCount.0": "1000000.0", "mass.0": "not available", "massUnit.0": "not available", "drugName.0": "KERENDIA", "drugDoseMagnitude.0": "20.0", "drugDoseUnit.0": "milligram"},
        {"species.0": "Homo sapiens", "organ.0": "not available", "cellLine.0": "HEK293", "cellCount.0": "1000000.0", "mass.0": "not available", "massUnit.0": "not available", "drugName.0": "KERENDIA", "drugDoseMagnitude.0": "20.0", "drugDoseUnit.0": "milligram"}
    ]
}

Steps Taken if New Vocabulary

If there is new vocabulary, then additiona steps must be taken.

We must first validate that these terms are legitimate to add (which the frontend actually does before the study is submitted. Do this via

a post request to 4999/validatetermsfortrainingresource/

With a payload like

{
    "new_vocabulary":["Bacteria","Azorhizobium"]
}

This will do things like check to make sure that terms are long enough.

Next, we add terms to the vocabulary. This is a FAST step (unlike training), so we recommend performing it every time. Do it like

a post request to `4999/addtermstovocabularyresource/

With a payload like

{
    "header":"species",
    "new_vocabulary":["new species 1","new species 2"]
}

It is worth noting that, for new terms, the valid_string and main_string will be the same - there is no synonym feature.

However, training is a SLOW step. Possible approaches include making it a background process that is done periodically (once a day). Being slightly out-of-date on this is OK, because the FAST vocabulary step ensures that upon the next training, the full vocabulary will be there.

a post request to 4999/trainvocabularyresource/

With a payload like

{
    "header":"species"
}

Accessing Submitted Studies

On our frontend, users are presented with a result Excel file containing the standardized metadata. That is good, but we might also want programmatic access to submitted studies for connection to other data processing.

We can do this without any knowledge about what the user submitted as long their submission was slightly logical.

First, we can determine all authors that have ever submitted a study using

a post request to 4999/authorid/

With a payload like

{}

Then, select the author that makes sense and get all submitted study IDs using

a post request to 4999/studyid/

With a payload like

{
    "author_id":"parkerbremer"
}

The resulting studyIDs are the ms since the Linux time epoch. So, we can probably just use the largest number as that corresponds to the most recently submitted study. Then, we can get the full transformed samples like

a post request to 4999/samples/

With a payload like

{
    "study_id":"1686247553.2546"
}