Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3236

Interpret nucleotide symbols when searching by residue

    Details

    • Story Points:
      3
    • Sprint:
      Summer 3 2023 June 12, Summer 4 2023 June 26, Summer 5 2023 July 10

      Description

      Situation: Under the Advanced Search tab the Search can be set to residues. This is extremely useful as a user can search for primer locations or motifs. While the Residues search does allow for wildcards (.[]*) it does not appear to understand nucleotide symbols such as R [G/A] Y [C/T] etc. So a motif found in a paper such as CACRTS does not work correctly under the Advanced Search for Residues in IGB.

      Task: Expand the logic for IGB Advanced Search for Residues so that IGB can understand Nucleotide Symbols.

      R A or G
      Y C or T
      S G or C
      W A or T
      K G or T
      M A or C
      B C or G or T
      D A or G or T
      H A or C or T
      V A or C or G
      N any base

      For example, if a user were to currently use the Advanced Search for Residues in IGB to look for the motif RYSNATCG IGB would not be able to find the motif, as IGB does not understand what RYSN refers to. New logic needs to be added to IGB so that when searching, IGB understands that R can match to either A or G, Y matches C or T, etc.

        Attachments

        1. escape_working.png
          escape_working.png
          35 kB
        2. pattern_error.png
          pattern_error.png
          86 kB
        3. regex.png
          regex.png
          3 kB
        4. result.png
          result.png
          101 kB
        5. with_escape_character.png
          with_escape_character.png
          31 kB
        6. with_escape_character.png
          with_escape_character.png
          31 kB
        7. without_escape_character.png
          without_escape_character.png
          40 kB
        8. without_escape_character.png
          without_escape_character.png
          40 kB

          Issue Links

            Activity

            Hide
            nfreese Nowlan Freese added a comment - - edited

            The CACRTS example is from this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8318262/

            The paper is examining the Arabidopsis thaliana genome.

            Show
            nfreese Nowlan Freese added a comment - - edited The CACRTS example is from this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8318262/ The paper is examining the Arabidopsis thaliana genome.
            Hide
            kgopu Kaushik Gopu added a comment -

            when user enters any sequence in search box which has special nucleotide symbols(say NACT or MACT), then the logic interprets those special symbols and convert them into their respective substitutes(ex: NACT->[AGCT]ACT).

            How the pattern column displayed to user:
            for example, when user enters NACT in search box, it usually searches for pattern [AGCT]CTG but it doesn't disclose this regex pattern under pattern column. instead, it shows the actual one, which is NACT itself.(check the image below)

            Show
            kgopu Kaushik Gopu added a comment - when user enters any sequence in search box which has special nucleotide symbols(say NACT or MACT), then the logic interprets those special symbols and convert them into their respective substitutes(ex: NACT-> [AGCT] ACT). How the pattern column displayed to user: for example, when user enters NACT in search box, it usually searches for pattern [AGCT] CTG but it doesn't disclose this regex pattern under pattern column. instead, it shows the actual one, which is NACT itself.(check the image below)
            Hide
            nfreese Nowlan Freese added a comment - - edited

            Additional request:

            1) Display the exact user search in the Pattern column.
            For example, if a user searches for:

            [ATCG]T.*NNNRR

            the pattern displayed in the pattern column would be:

            [ATCG]T.*NNNRR

            2) Users need to be able to search for N literals. The four nucleotides are ATCG but genomes often also include N when parts of the genome are unknown. This may require escaping the N meta character so that it searches for literal N and not [ATCG].
            For example, if a user searches for:

            \N

            this should match only the N and not the other nucleotides: ATCG N ATCG

            Show
            nfreese Nowlan Freese added a comment - - edited Additional request: 1) Display the exact user search in the Pattern column. For example, if a user searches for: [ATCG]T.*NNNRR the pattern displayed in the pattern column would be: [ATCG]T.*NNNRR 2) Users need to be able to search for N literals. The four nucleotides are ATCG but genomes often also include N when parts of the genome are unknown. This may require escaping the N meta character so that it searches for literal N and not [ATCG] . For example, if a user searches for: \N this should match only the N and not the other nucleotides: ATCG N ATCG
            Hide
            kgopu Kaushik Gopu added a comment -

            \Q<LETTER>\E is working for escaping characters.
            There are two ways define escaping characters
            1) Precede a metacharacter with a backslash ()
            2) Enclose a metacharacter with \Q and \E
            first one is not working for some reason. but if we use second one, it works as expected.

            Show
            kgopu Kaushik Gopu added a comment - \Q<LETTER>\E is working for escaping characters. There are two ways define escaping characters 1) Precede a metacharacter with a backslash () 2) Enclose a metacharacter with \Q and \E first one is not working for some reason. but if we use second one, it works as expected.
            Hide
            ann.loraine Ann Loraine added a comment -

            Do please try to determine why the backslash is failing to properly function as an escape character. I bet there is something wonky in our existing code that is preventing it from functioning as expected.

            Show
            ann.loraine Ann Loraine added a comment - Do please try to determine why the backslash is failing to properly function as an escape character. I bet there is something wonky in our existing code that is preventing it from functioning as expected.
            Hide
            ann.loraine Ann Loraine added a comment -

            Kaushik Gopu: suspects \N may be getting interpreted as a newline symbol.

            Show
            ann.loraine Ann Loraine added a comment - Kaushik Gopu : suspects \N may be getting interpreted as a newline symbol.
            Hide
            kgopu Kaushik Gopu added a comment - - edited

            my observation for the following escape characters:
            we cannot escape character using \ because for the following reason:
            It is an error to use a backslash prior to any alphabetic character that does not denote an escaped construct; these are reserved for future extensions to the regular-expression language. A backslash may be used prior to a non-alphabetic character regardless of whether that character is part of an unescaped construct.( check pattern_error image)

            This leaves us only one option to escape characters, which is \Q<literal>\E
            .

            Show
            kgopu Kaushik Gopu added a comment - - edited my observation for the following escape characters: we cannot escape character using \ because for the following reason: It is an error to use a backslash prior to any alphabetic character that does not denote an escaped construct; these are reserved for future extensions to the regular-expression language. A backslash may be used prior to a non-alphabetic character regardless of whether that character is part of an unescaped construct.( check pattern_error image) This leaves us only one option to escape characters, which is \Q<literal>\E .
            Hide
            nfreese Nowlan Freese added a comment -

            [~aloraine] - Please see Kaushik's comment above regarding escaping characters.

            We could leave Kaushik's code the way it currently is (N is substituted for ATCG), so if a user wants to find a literal N they would need to escape it. Escaping multiple N characters does work in IGB, for example:

            GTA\QNNN\E

            i.e. the user does not have to enter \QN\E\QN\E for every single N they are searching for.

            Show
            nfreese Nowlan Freese added a comment - [~aloraine] - Please see Kaushik's comment above regarding escaping characters. We could leave Kaushik's code the way it currently is (N is substituted for ATCG), so if a user wants to find a literal N they would need to escape it. Escaping multiple N characters does work in IGB, for example: GTA\QNNN\E i.e. the user does not have to enter \QN\E\QN\E for every single N they are searching for.
            Hide
            nfreese Nowlan Freese added a comment -

            The general consensus is to move forward with Kaushik's current changes.

            Searching for N will find A, T, C, or G.
            Searching for \QN\E will find N.

            Kaushik Gopu - please push your changes to your repository and move ticket to needs 1st level review.

            Show
            nfreese Nowlan Freese added a comment - The general consensus is to move forward with Kaushik's current changes. Searching for N will find A, T, C, or G. Searching for \QN\E will find N. Kaushik Gopu - please push your changes to your repository and move ticket to needs 1st level review.
            Hide
            kgopu Kaushik Gopu added a comment - - edited

            my current logic simply interprets nucleotide symbols and replaces them with their respective substitutes. with regards to "N", I need to ensure that it is not gonna replaced by meta characters if it is surrounded by escape characters. currently, I am working to handle this particular case, which is not straight forward and requires investigation.

            Show
            kgopu Kaushik Gopu added a comment - - edited my current logic simply interprets nucleotide symbols and replaces them with their respective substitutes. with regards to "N", I need to ensure that it is not gonna replaced by meta characters if it is surrounded by escape characters. currently, I am working to handle this particular case, which is not straight forward and requires investigation.
            Hide
            kgopu Kaushik Gopu added a comment - - edited

            How I handled "N" case:

            I have created one regex, which is
            Explanation of above regex:

            if there is character "N" in search sequence, it basically checks whether it is surrounded by escape characters or not. if yes, no substitution else substitute with respective symbols.

            Breakdown of regex:
            (?<![\\\\]): not \
            (?<![Q]) : not Q
            [Nn]: if there is N or n ( since it case insensitive search)
            ((?<![\\\\]): not \
            (?<![E])): not E
            Overall, don't do anything if N is surrounded by \Q and \E else replace.

            we can test any regex expression using this https://regex101.com/ online tool(switch to java 8 before testing). as of now it works fine but I want to spend some time for testing it and after that I'll be pushing changes to remote.

            please check the attached image for the results below.

            Show
            kgopu Kaushik Gopu added a comment - - edited How I handled "N" case: I have created one regex, which is Explanation of above regex: if there is character "N" in search sequence, it basically checks whether it is surrounded by escape characters or not. if yes, no substitution else substitute with respective symbols. Breakdown of regex: (?<![\\\\]): not \ (?<! [Q] ) : not Q [Nn] : if there is N or n ( since it case insensitive search) ((?<![\\\\]): not \ (?<! [E] )): not E Overall, don't do anything if N is surrounded by \Q and \E else replace. we can test any regex expression using this https://regex101.com/ online tool(switch to java 8 before testing). as of now it works fine but I want to spend some time for testing it and after that I'll be pushing changes to remote. please check the attached image for the results below.
            Hide
            kgopu Kaushik Gopu added a comment - - edited

            Search functionality works as expected. However, there is an issue with "what needs be under pattern column".

            last time we agreed upon user-searched-string should be under pattern column. However, when user enters AGCN the pattern column shows AGC[AGCT], which is technically correct, but the actual result should be AGCN. to make this happen we need to pass original-string to the function so that we can set it to the pattern column. unfortunately, this function is the implementation of the one of the interfaces, which is used by one other class as well. in my opinion, modifying interface is not a great idea for this tiny feature.

            My solution :
            AGCN will be shown as AGC[AGCT] under patterns column, AGCY shown as AGC[CT] and so on. This is technically correct because we are showing user that special characters are being replaced with its substitutes. Moreover, we don't need to disturb any existing code.

            apart from this everything works fine including skipping escape characters. [~aloraine] and Nowlan Freese please let me know about your thoughts on this.

            Show
            kgopu Kaushik Gopu added a comment - - edited Search functionality works as expected. However, there is an issue with "what needs be under pattern column". last time we agreed upon user-searched-string should be under pattern column. However, when user enters AGCN the pattern column shows AGC [AGCT] , which is technically correct, but the actual result should be AGCN. to make this happen we need to pass original-string to the function so that we can set it to the pattern column. unfortunately, this function is the implementation of the one of the interfaces, which is used by one other class as well. in my opinion, modifying interface is not a great idea for this tiny feature. My solution : AGCN will be shown as AGC [AGCT] under patterns column, AGCY shown as AGC [CT] and so on. This is technically correct because we are showing user that special characters are being replaced with its substitutes. Moreover, we don't need to disturb any existing code. apart from this everything works fine including skipping escape characters. [~aloraine] and Nowlan Freese please let me know about your thoughts on this.
            Hide
            ann.loraine Ann Loraine added a comment -

            During scrum, team agreed the above plan is the best choice.

            Show
            ann.loraine Ann Loraine added a comment - During scrum, team agreed the above plan is the best choice.
            Hide
            kgopu Kaushik Gopu added a comment - - edited

            I have committed changes to the repo and it is ready for review.
            Here is the link to branch:[ https://bitbucket.org/kaushik-gopu/kgopu_integrated-genome-browser/src/IGBF-3236/]
            Here is the link to downloads folder after running pipeline:[ https://bitbucket.org/kaushik-gopu/kgopu_integrated-genome-browser/downloads/]

            Things to be tested:
            i) search for AGCN. N will be replaced by [AGCT] and produce results for it. not only for N but also test it for other meta characters .
            ii) search for \QN\E. N won't be replace by meta characters.
            iii) since you(Nowlan Freese or [~aloraine]) have good knowledge on this tool, try to test for possible edge cases.

            Note : I haven't made any changes to the pattern column because I don't want to disturb other parts of the code that have nothing to do with this feature. it is not a good choice to change interface( specification) for this feature. Instead, we can create good documentation about regular expressions (regex) to empower users with the necessary knowledge to effectively utilize them.

            Show
            kgopu Kaushik Gopu added a comment - - edited I have committed changes to the repo and it is ready for review. Here is the link to branch:[ https://bitbucket.org/kaushik-gopu/kgopu_integrated-genome-browser/src/IGBF-3236/ ] Here is the link to downloads folder after running pipeline:[ https://bitbucket.org/kaushik-gopu/kgopu_integrated-genome-browser/downloads/ ] Things to be tested: i) search for AGCN. N will be replaced by [AGCT] and produce results for it. not only for N but also test it for other meta characters . ii) search for \QN\E. N won't be replace by meta characters. iii) since you( Nowlan Freese or [~aloraine] ) have good knowledge on this tool, try to test for possible edge cases. Note : I haven't made any changes to the pattern column because I don't want to disturb other parts of the code that have nothing to do with this feature. it is not a good choice to change interface( specification) for this feature. Instead, we can create good documentation about regular expressions (regex) to empower users with the necessary knowledge to effectively utilize them.
            Hide
            nfreese Nowlan Freese added a comment - - edited

            While testing I found a minor issue:
            The \Q\E escape characters work for a single N but not for multiple.
            For example, the following works in IGB release (Arabidopsis genome Chr1:14,511,700-14,511,744):

            GGAGA\QNNNNN\E

            But no results are found on Kaushik's branch.
            Surrounding every N with the escape characters like the following will work, but is somewhat annoying for the user:

            GGAGA\QN\E\QN\E\QN\E

            This most likely indicates that Kaushik's regex for determining whether an N is surrounded by escape characters is limiting to only a single N.

            I recommend seeing if this can be quickly addressed via Kaushik's regex code.

            Show
            nfreese Nowlan Freese added a comment - - edited While testing I found a minor issue: The \Q\E escape characters work for a single N but not for multiple. For example, the following works in IGB release (Arabidopsis genome Chr1:14,511,700-14,511,744): GGAGA\QNNNNN\E But no results are found on Kaushik's branch. Surrounding every N with the escape characters like the following will work, but is somewhat annoying for the user: GGAGA\QN\E\QN\E\QN\E This most likely indicates that Kaushik's regex for determining whether an N is surrounded by escape characters is limiting to only a single N. I recommend seeing if this can be quickly addressed via Kaushik's regex code.
            Hide
            ann.loraine Ann Loraine added a comment -

            Good catch

            I think the problem is not too serious.

            Here is why:

            To my way of thinking, the ability to escape the "N" character in a residues search is a specialized capability that a small number / percentage of IGB's audience will use. So, I feel it is OK to require extra knowledge to use the new capability.

            I recommend creating a section / linkable page documenting this aspect of the residues search capability. The section headline could say something like "Specialized searching: finding pattern instances in genomic sequence" or "Escaping a wild card character to find its literal value in the genomic sequence".

            Speaking of which, currently a user has to load the entire genomic sequence into memory in order to perform a genomic sequence search. This is problematic because sometimes these sequences can get very large. I think it would be very useful to create such a thing as a REST (or similar) Web service that could provide the search results to IGB upon demand, using next-to-zero zero local resources.

            How it could work: A user would click in the Advanced Search Tab and view the same interface as now, except there would now be a new "whole genome"-style option for defining the search space. This would be the default setting. Then, when a user clicks the "find" button to launch a search, IGB would next forward a request to run that search on to a remote host, configured to answer such requests from IGB clients only and of course transmitted on an encrypted channel.

            Once the search finishes, the host would deliver the results in an IGB-friendly format.

            Show
            ann.loraine Ann Loraine added a comment - Good catch I think the problem is not too serious. Here is why: To my way of thinking, the ability to escape the "N" character in a residues search is a specialized capability that a small number / percentage of IGB's audience will use. So, I feel it is OK to require extra knowledge to use the new capability. I recommend creating a section / linkable page documenting this aspect of the residues search capability. The section headline could say something like "Specialized searching: finding pattern instances in genomic sequence" or "Escaping a wild card character to find its literal value in the genomic sequence". Speaking of which, currently a user has to load the entire genomic sequence into memory in order to perform a genomic sequence search. This is problematic because sometimes these sequences can get very large. I think it would be very useful to create such a thing as a REST (or similar) Web service that could provide the search results to IGB upon demand, using next-to-zero zero local resources. How it could work: A user would click in the Advanced Search Tab and view the same interface as now, except there would now be a new "whole genome"-style option for defining the search space. This would be the default setting. Then, when a user clicks the "find" button to launch a search, IGB would next forward a request to run that search on to a remote host, configured to answer such requests from IGB clients only and of course transmitted on an encrypted channel. Once the search finishes, the host would deliver the results in an IGB-friendly format.
            Hide
            kgopu Kaushik Gopu added a comment -

            I appreciate your valuable input, Dr. Freese, as it has shed light on the issue from a different perspective. Based on your insights, I decided to take a different approach rather than relying on complex regex expressions. So far, the new approach seems to be functioning well. Once I am confident that everything is working as intended, I will provide an update on the progress.

            Show
            kgopu Kaushik Gopu added a comment - I appreciate your valuable input, Dr. Freese, as it has shed light on the issue from a different perspective. Based on your insights, I decided to take a different approach rather than relying on complex regex expressions. So far, the new approach seems to be functioning well. Once I am confident that everything is working as intended, I will provide an update on the progress.
            Hide
            kgopu Kaushik Gopu added a comment -

            I have committed changes to repo and the code should able to allow multiple N's between escape characters. please test and let me know if there necessary corrections to be done.
            link to branch: https://bitbucket.org/kaushik-gopu/kgopu_integrated-genome-browser/branch/IGBF-3236
            link to downloads folder: https://bitbucket.org/kaushik-gopu/kgopu_integrated-genome-browser/downloads/

            Show
            kgopu Kaushik Gopu added a comment - I have committed changes to repo and the code should able to allow multiple N's between escape characters. please test and let me know if there necessary corrections to be done. link to branch: https://bitbucket.org/kaushik-gopu/kgopu_integrated-genome-browser/branch/IGBF-3236 link to downloads folder: https://bitbucket.org/kaushik-gopu/kgopu_integrated-genome-browser/downloads/
            Hide
            nfreese Nowlan Freese added a comment -

            A couple of changes/issues:

            1. Make sure that you do not import * in Java classes. The IDE is probably doing this automatically, there should be a way to disable that behavior.
            2. Searching for NNNN is only finding literal N instead of [ATCG].
            3. I am unable to escape characters. For example,
              \Q.....\E

              should not find any matches.

            Show
            nfreese Nowlan Freese added a comment - A couple of changes/issues: Make sure that you do not import * in Java classes. The IDE is probably doing this automatically, there should be a way to disable that behavior. Searching for NNNN is only finding literal N instead of [ATCG] . I am unable to escape characters. For example, \Q.....\E should not find any matches.
            Hide
            kgopu Kaushik Gopu added a comment - - edited
            Show
            kgopu Kaushik Gopu added a comment - - edited Revamped code to enable escaping of any character within escape literals Now the code should able to handle all types of searches. Requesting first level review: https://bitbucket.org/kaushik-gopu/kgopu_integrated-genome-browser/branch/IGBF-3236
            Hide
            nfreese Nowlan Freese added a comment -

            Testing on Kaushik's branch with Arabidopsis Chr1:14,511,701-14,511,743

            • N -> finds [ACGT]
            • \QN\E -> finds N
            • \QGAGANNNN\E -> finds GAGANNNN
            • RR\QNNNN\E -> finds [AG][AG]NNNN
            • SS -> finds [GC][GC]
            • \QSS\E -> finds nothing
            • . -> finds any letter
            • \Q.\E -> finds nothing
            • Other metacharacters find appropriate matches

            Every residue search I can think of appears to be working correctly.

            Ready for pull request - please squash commits prior to submitting pull request, if not sure how best to do this please ask Nowlan.

            Show
            nfreese Nowlan Freese added a comment - Testing on Kaushik's branch with Arabidopsis Chr1:14,511,701-14,511,743 N -> finds [ACGT] \QN\E -> finds N \QGAGANNNN\E -> finds GAGANNNN RR\QNNNN\E -> finds [AG] [AG] NNNN SS -> finds [GC] [GC] \QSS\E -> finds nothing . -> finds any letter \Q.\E -> finds nothing Other metacharacters find appropriate matches Every residue search I can think of appears to be working correctly. Ready for pull request - please squash commits prior to submitting pull request, if not sure how best to do this please ask Nowlan.
            Show
            kgopu Kaushik Gopu added a comment - Pull request submitted: https://bitbucket.org/lorainelab/integrated-genome-browser/pull-requests/932
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            The PR's source branch is missing a commit that is on the target branch in the target repository - "master".

            See: hosted https://bitbucket.org/kaushik-gopu/kgopu_integrated-genome-browser/commits/?search=IGBF-3236

            To ensure the cleanest possible history, rebase development branch IGBF-3236 onto the most recent commit on the master branch, from the team repository.

            To check that my concern is right, run the branch IGBF-3236 installer and look for Kaushik Gopu's name in the credits, as this appears to be the missing commit.

            Show
            ann.loraine Ann Loraine added a comment - - edited The PR's source branch is missing a commit that is on the target branch in the target repository - "master". See: hosted https://bitbucket.org/kaushik-gopu/kgopu_integrated-genome-browser/commits/?search=IGBF-3236 To ensure the cleanest possible history, rebase development branch IGBF-3236 onto the most recent commit on the master branch, from the team repository. To check that my concern is right, run the branch IGBF-3236 installer and look for Kaushik Gopu 's name in the credits, as this appears to be the missing commit.
            Hide
            kgopu Kaushik Gopu added a comment -

            rebased IGBF-3236 onto the most recent commit on the master branch, from the team repo.
            pull request submitted: https://bitbucket.org/lorainelab/integrated-genome-browser/pull-requests/933

            Show
            kgopu Kaushik Gopu added a comment - rebased IGBF-3236 onto the most recent commit on the master branch, from the team repo. pull request submitted: https://bitbucket.org/lorainelab/integrated-genome-browser/pull-requests/933
            Hide
            ann.loraine Ann Loraine added a comment -

            PR is merged. Master branch installers are built and ready for testing.

            Show
            ann.loraine Ann Loraine added a comment - PR is merged. Master branch installers are built and ready for testing.
            Hide
            nfreese Nowlan Freese added a comment -

            Tested on Mac using master branch installer.

            Following same testing outlined in previous commit, all searches are working correctly.

            Closing ticket.

            Show
            nfreese Nowlan Freese added a comment - Tested on Mac using master branch installer. Following same testing outlined in previous commit, all searches are working correctly. Closing ticket.

              People

              • Assignee:
                kgopu Kaushik Gopu
                Reporter:
                nfreese Nowlan Freese
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: