14.2.18. crate_anon.common.stringfunc

crate_anon/common/stringfunc.py


Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.


Simple string functions.

crate_anon.common.stringfunc.compress_docstring(docstring: str) str[source]

Splats a docstring onto a single line, compressing all whitespace.

crate_anon.common.stringfunc.does_text_contain_word_chars(text: str) bool[source]

Is a string worth treating as interesting text – does it contain “word” characters?

crate_anon.common.stringfunc.get_digit_string_from_vaguely_numeric_string(s: str) str[source]

Strips non-digit characters from a string.

For example, converts "(01223) 123456" to "01223123456".

crate_anon.common.stringfunc.get_docstring(cls: Type) str[source]

Fetches a docstring from a class.

crate_anon.common.stringfunc.get_spec_match_regex(spec: str) Pattern[source]

Returns a compiled, case-insensitive regular expression representing a shell-style pattern (using *, ? and similar wildcards; see https://docs.python.org/3.5/library/fnmatch.html).

Parameters

spec – the pattern to pass to fnmatch, e.g. "patient_addr*".

Returns

the compiled regular expression

crate_anon.common.stringfunc.make_twocol_table(colnames: List[str], rows: List[List[str]], max_table_width: int = 79, padding_width: int = 1, vertical_lines: bool = True, rewrap_right_col: bool = True) str[source]

Formats a two-column table. Tries not to split/wrap the left-hand column, but resizes the right-hand column.

crate_anon.common.stringfunc.reduce_to_alphanumeric(s: str) str[source]

Strips non-alphanumeric characters from a string.

For example, converts "PE12 3AB" to "PE12 3AB".

crate_anon.common.stringfunc.relevant_for_nlp(x: Optional[str]) bool[source]

Does this string contain content that’s relevant for NLP? We want to eliminate None values, and strings that do not contain relevant content. A string containing only whitespace is not relevant.

crate_anon.common.stringfunc.remove_whitespace(s: str) str[source]

Removes whitespace from a string.

crate_anon.common.stringfunc.trim_docstring(docstring: str) str[source]

Removes initial/terminal blank lines and leading whitespace from docstrings.

This is the PEP257 implementation (https://peps.python.org/pep-0257/), except with sys.maxint replaced by sys.maxsize (see https://docs.python.org/3.1/whatsnew/3.0.html#integers).

Demonstration:

from crate_anon.common.stringfunc import trim_docstring
print(trim_docstring.__doc__)
print(trim_docstring(trim_docstring.__doc__))
crate_anon.common.stringfunc.uprint(*objects: Any, sep: str = ' ', end: str = '\n', file: TextIO = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>) None[source]

Prints strings to outputs that support UTF-8 encoding, but also to those that do not (e.g. Windows stdout, sometimes).

Parameters
  • *objects – things to print

  • sep – separator between those objects

  • end – print this at the end

  • file – file-like object to print to

See https://stackoverflow.com/questions/14630288/unicodeencodeerror-charmap-codec-cant-encode-character-maps-to-undefined

Examples:

  • Linux, Python 3.6.8 console: sys.stdout.encoding == "UTF-8"

  • Windows, Python 3.7.4 console: sys.stdout.encoding == "utf-8"

  • Windows, Python 3.7.4, from script: sys.stdout.encoding == "cp1252"